Using StAX to create index for XML for quick access
You could work with a generated XML parser using ANTLR4. The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB. 1. Get XML Grammar cd /tmp git clone https://github.com/antlr/grammars-v4 2. Generate Parser cd /tmp/grammars-v4/xml/ mvn clean install 3. Copy Generated Java files to your Project … Read more