Using StAX to create index for XML for quick access

You could work with a generated XML parser using ANTLR4. The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB. 1. Get XML Grammar cd /tmp git clone https://github.com/antlr/grammars-v4 2. Generate Parser cd /tmp/grammars-v4/xml/ mvn clean install 3. Copy Generated Java files to your Project … Read more

STL deque accessing by index is O(1)?

I found this deque implementation from Wikipedia: Storing contents in multiple smaller arrays, allocating additional arrays at the beginning or end as needed. Indexing is implemented by keeping a dynamic array containing pointers to each of the smaller arrays. I guess it answers my question.

Python Random Access File

This seems like just the sort of thing mmap was designed for. A mmap object creates a string-like interface to a file: >>> f = open(“bonnie.txt”, “wb”) >>> f.write(“My Bonnie lies over the ocean.”) >>> f.close() >>> f.open(“bonnie.txt”, “r+b”) >>> mm = mmap(f.fileno(), 0) >>> print mm[3:9] Bonnie In case you were wondering, mmap objects … Read more

Compression formats with good support for random access within archives? [closed]

Take a look at dictzip. It is compatible with gzip and allows coarse random access. An excerpt from its man page: dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner which is completely compatible with the gzip file format. An extension to the gzip file format (Extra Field, described in 2.3.1.1 of RFC … Read more