Are GCC and Clang parsers really handwritten?

There’s a folk-theorem that says C is hard to parse, and C++ essentially impossible.

It isn’t true.

What is true is that C and C++ are pretty hard to parse using LALR(1) parsers without hacking the parsing machinery and tangling in symbol table data. GCC in fact used to parse them, using YACC and additional hackery like this, and yes it was ugly. Now GCC uses handwritten parsers, but still with the symbol table hackery. The Clang folks never tried to use automated parser generators; AFAIK the Clang parser has always been hand-coded recursive descent.

What is true, is that C and C++ are relatively easy to parse with stronger automatically generated parsers, e.g., GLR parsers, and you don’t need any hacks. The Elsa C++ parser is one example of this. Our C++ Front End is another (as are all our “compiler” front ends, GLR is pretty wonderful parsing technology).

Our C++ front end isn’t as fast as GCC’s, and certainly slower than Elsa; we’ve put little energy into tuning it carefully because we have other more pressing issues (nontheless it has been used on millions of lines of C++ code). Elsa is likely slower than GCC simply because it is more general. Given processor speeds these days, these differences might not matter a lot in practice.

But the “real compilers” that are widely distributed today have their roots in compilers of 10 or 20 years ago or more. Inefficiencies then mattered much more, and nobody had heard of GLR parsers, so people did what they knew how to do. Clang is certainly more recent, but then folk theorems retain their “persuasiveness” for a long time.

You don’t have to do it that way anymore. You can very reasonably use GLR and other such parsers as front ends, with an improvement in compiler maintainability.

What is true, is that getting a grammar that matches your friendly neighborhood compiler’s behavior is hard. While virtually all C++ compilers implement (most) of the original standard, they also tend have lots of dark corner extensions, e.g., DLL specifications in MS compilers, etc. If you have a strong parsing engine, you can
spend your time trying to get the final grammar to match reality, rather than trying to bend your grammar to match the limitations of your parser generator.

EDIT November 2012: Since writing this answer, we’ve improved our C++ front end to handle full C++11, including ANSI, GNU, and MS variant dialects. While there was lots of extra stuff, we don’t have to change our parsing engine; we just revised the grammar rules. We did have to change the semantic analysis; C++11 is semantically very complicated, and this work swamps the effort to get the parser to run.

EDIT February 2015: … now handles full C++14. (See get human readable AST from c++ code for GLR parses of a simple bit of code, and C++’s infamous “most vexing parse”).

EDIT April 2017: Now handles (draft) C++17.

Leave a Comment