Using StAX to create index for XML for quick access

You could work with a generated XML parser using ANTLR4. The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB. 1. Get XML Grammar cd /tmp git clone https://github.com/antlr/grammars-v4 2. Generate Parser cd /tmp/grammars-v4/xml/ mvn clean install 3. Copy Generated Java files to your Project … Read more

Is there a Java XML API that can parse a document without resolving character entities?

The STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property: Requires the parser to replace internal entity references with their replacement text and report them as characters This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or … Read more

Stax XMLStreamReader check for the next event without moving ahead

“Going back” in a stream implies some kind of memory, so there is no point in sticking to the most memory-efficient tool. XMLEventReader can handle this with ease: public class Main { public static void main(String args[]) throws Exception { Unmarshaller aUnmarshaller = JAXBContext.newInstance(A.class).createUnmarshaller(); Unmarshaller bUnmarshaller = JAXBContext.newInstance(B.class).createUnmarshaller(); Unmarshaller cUnmarshaller = JAXBContext.newInstance(C.class).createUnmarshaller(); try (InputStream input … Read more