Parsing strings of user input using the cparse library from git

Failing that, is there another way to parse strings of user input and use them as functions?

I would like to answer this part of your question because I have the feeling that a full-featured C parser might be a little bit too heavy for what might be your intention. (Btw. once you got the C parser running – how to process its output? Dynamic linking?)

Instead, I want to show you how to build a simple calculator (with recursive descent parser) on your own. For the documentation of the techniques I will use, I warmly recommend Compilers (Principles, Techniques & Tools) by Aho, Lam, Sethi, Ullman (better known as “Dragon books”) and especially Chapter 4.

The Tiny Calculator Project

In the following I describe my sample solution part by part.

Language Design

Before starting to write a compiler or interpreter, it’s reasonable to define a language which shall be accepted. I want to use a very limited sub-set of C: expressions consisting of

  • C like floating point numbers (constants)
  • C like identifiers (variables)
  • unary operators + and -
  • binary operators +, -, *, and /
  • parentheses ()
  • semicolons ; (to mark the end of an expression, mandatory).

Whitespaces (including line-breaks) will be simply ignored but may be used to separate things as well as to improve human readability. C or C++ like comments (and a lot of other sugar) I didn’t consider to keep the source code as minimal as possible. (For all that, I got nearly 500 lines.)

The specific example of the OP will fit into this language with an added semicolon:

a*(b+3);

There will be only one type supported: double. Thus, I don’t need types or any declaration which makes things easier.

Before I started to sketch the grammar of this language, I was thinking about the “target” of compiling and decided to make classes for…

An Abstract Syntax Tree

At first – a class to store variables:

// storage of variables

class Var {
  private:
    double _value;
  public:
    Var(): _value() { }
    ~Var() = default;
    double get() const { return _value; }
    void set(double value) { _value = value; }
};

The variables provide storage for a value but not for the identifier. The latter is stored separately as it is not needed for the actual usage of a variable but only to find it by name:

typedef std::map<std::string, Var> VarTable;

The usage of a std::map automates the creation of a variable. As known from many high level languages, a variable starts to exist when its accessed the first time.

The Abstract Syntax Tree is a

a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code.

I took this text from the above linked Wikipedia article – couldn’t said it shorter. In the following my classes for the AST:

// abstract syntax tree -> storage of "executable"

namespace AST {

class Expr {
  protected:
    Expr() = default;
  public:
    virtual ~Expr() = default;
  public:
    virtual double solve() const = 0;
};

class ExprConst: public Expr {
  private:
    double _value;
  public:
    ExprConst(double value): Expr(), _value(value) { }
    virtual ~ExprConst() = default;
    virtual double solve() const { return _value; }
};

class ExprVar: public Expr {
  private:
    Var *_pVar;
  public:
    ExprVar(Var *pVar): Expr(), _pVar(pVar) { }
    virtual ~ExprVar() = default;
    virtual double solve() const { return _pVar->get(); }
};

class ExprUnOp: public Expr {
  protected:
    Expr *_pArg1;
  protected:
    ExprUnOp(Expr *pArg1): Expr(), _pArg1(pArg1) { }
    virtual ~ExprUnOp() { delete _pArg1; }
};

class ExprUnOpNeg: public ExprUnOp {
  public:
    ExprUnOpNeg(Expr *pArg1): ExprUnOp(pArg1) { }
    virtual ~ExprUnOpNeg() = default;
    virtual double solve() const
    {
      return -_pArg1->solve();
    }
};

class ExprBinOp: public Expr {
  protected:
    Expr *_pArg1, *_pArg2;
  protected:
    ExprBinOp(Expr *pArg1, Expr *pArg2):
      Expr(), _pArg1(pArg1), _pArg2(pArg2)
    { }
    virtual ~ExprBinOp() { delete _pArg1; delete _pArg2; }
};

class ExprBinOpAdd: public ExprBinOp {
  public:
    ExprBinOpAdd(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpAdd() = default;
    virtual double solve() const
    {
      return _pArg1->solve() + _pArg2->solve();
    }
};

class ExprBinOpSub: public ExprBinOp {
  public:
    ExprBinOpSub(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpSub() = default;
    virtual double solve() const
    {
      return _pArg1->solve() - _pArg2->solve();
    }
};

class ExprBinOpMul: public ExprBinOp {
  public:
    ExprBinOpMul(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpMul() = default;
    virtual double solve() const
    {
      return _pArg1->solve() * _pArg2->solve();
    }
};

class ExprBinOpDiv: public ExprBinOp {
  public:
    ExprBinOpDiv(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpDiv() = default;
    virtual double solve() const
    {
      return _pArg1->solve() / _pArg2->solve();
    }
};

Thus, using the AST classes, the representation of sample a*(b+3) would be

VarTable varTable;
Expr *pExpr
= new ExprBinOpMul(
    new ExprVar(&varTable["a"]),
    new ExprBinOpAdd(
      new ExprVar(&varTable["b"]),
      new ExprConst(3)));

Note:

There is no class derived from Expr to represent the parentheses () because this is simply not necessary. The processing of parentheses is considered when building the tree itself. Normally, the operators with higher precedence become children of the operators with lower precedence. As a result, the former are computed before the latter. In the above example, the instance of ExprBinOpAdd is a child of the instance of ExprBinOpMul (although precedence of multiply is higher than precedence of add) which results from the proper consideration of the parentheses.

Beside of storing a parsed expression, this tree can be used to compute the expression by calling the Expr::solve() method of the root node:

double result = pExpr->solve();

Having a backend for our Tiny Calculator, next is the front-end.

A Grammar

A formal language is best described by a grammar.

program
  : expr Semicolon program
  | <empty>
  ;

expr
  : addExpr
  ;

addExpr
  : mulExpr addExprRest
  ;

addExprRest
  : addOp mulExpr addExprRest
  | <empty>
  ;

addOp
  : Plus | Minus
  ;

mulExpr
  : unaryExpr mulExprRest
  ;

mulExprRest
  : mulOp unaryExpr mulExprRest
  | <empty>
  ;

mulOp
  : Star | Slash
  ;

unaryExpr
  : unOp unaryExpr
  | primExpr
  ;

unOp
  : Plus | Minus
  ;

primExpr
  : Number
  | Id
  | LParen expr RParen
  ;

with start symbol program.

The rules contain

  • terminal symbols (starting with uppercase letter) and
  • non-terminal symbols (starting with lowercase)
  • a colon (:) to separate left side from right side (non-terminal on left side may expand to the symbols on the right side).
  • vertical bars (|) to separate alternatives
  • an <empty> symbol for expanding to nothing (used to terminate recursions).

From the terminal symbols, I will derive the tokens for the scanner.

The non-terminal symbols will be transformed into the parser functions.

The separation of addExpr and mulExpr has been done intentionally. Thus, the precedence of multiplicative operators over additive operators will be “burnt in” the grammar itself. Obviously, the floating point constants, variable identifiers, or expressions in parentheses (accepted in primExpr) will have highest precedence.

The rules contain only right recursions. This is a requirement for recursive-descent parsers (as I’ve learnt theoretically in the Dragon books and from practical experiences in the debugger until I fully understood the reason why). Implementing a left recursive rule in a recursive-descent parser results in non-terminated recursions which, in turn, end up with a StackOverflow.

The Scanner

It’s usual to split the compiler in a scanner and a parser.

The Scanner processes the input character stream and groups characters together to tokens. The tokens are used as terminal symbols in the parser.

For the output of tokens, I made a class. In my professional projects, it stores additionally the exact file position to denote its origin. This is convenient to tag created objects with source code references as well as any output of error messages and debugging info. (…left out here to keep it as minimal as possible…)

// token class - produced in scanner, consumed in parser
struct Token {
  // tokens
  enum Tk {
    Plus, Minus, Star, Slash, LParen, RParen, Semicolon,
    Number, Id,
    EOT, Error
  };
  // token number
  Tk tk;
  // lexem as floating point number
  double number;
  // lexem as identifier
  std::string id;

  // constructors.
  explicit Token(Tk tk): tk(tk), number() { }
  explicit Token(double number): tk(Number), number(number) { }
  explicit Token(const std::string &id): tk(Id), number(), id(id) { }
};

There are two enumerators for special tokens:

  • EOT … end of text (remarks the end of input)
  • Error … generated for any character which does not fit in any other token.

The tokens are used as output of the actual scanner:

// the scanner - groups characters to tokens
class Scanner {
  private:
    std::istream &_in;
  public:
    // constructor.
    Scanner(std::istream &in): _in(in) { }
    /* groups characters to next token until the first character
     * which does not match (or end-of-file is reached).
     */
    Token scan()
    {
      char c;
      // skip white space
      do {
        if (!(_in >> c)) return Token(Token::EOT);
      } while (isspace(c));
      // classify character and build token
      switch (c) {
        case '+': return Token(Token::Plus);
        case '-': return Token(Token::Minus);
        case '*': return Token(Token::Star);
        case "https://stackoverflow.com/": return Token(Token::Slash);
        case '(': return Token(Token::LParen);
        case ')': return Token(Token::RParen);
        case ';': return Token(Token::Semicolon);
        default:
          if (isdigit(c)) {
            _in.unget(); double value; _in >> value;
            return Token(value);
          } else if (isalpha(c) || c == '_') {
            std::string id(1, c);
            while (_in >> c) {
              if (isalnum(c) || c == '_') id += c;
              else { _in.unget(); break; }
            }
            return Token(id);
          } else {
            _in.unget();
            return Token(Token::Error);
          }
      }
    }
};

The scanner is used in the parser.

The Parser

class Parser {
  private:
    Scanner _scanner;
    VarTable &_varTable;
    Token _lookAhead;

  private:
    // constructor.
    Parser(std::istream &in, VarTable &varTable):
      _scanner(in), _varTable(varTable), _lookAhead(Token::EOT)
    {
      scan(); // load look ahead initially
    }

    // calls the scanner to read the next look ahead token.
    void scan() { _lookAhead = _scanner.scan(); }

    // consumes a specific token.
    bool match(Token::Tk tk)
    {
      if (_lookAhead.tk != tk) {
        std::cerr << "SYNTAX ERROR! Unexpected token!" << std::endl;
        return false;
      }
      scan();
      return true;
    }

    // the rules:

    std::vector<AST::Expr*> parseProgram()
    {
      // right recursive rule
      // -> can be done as iteration
      std::vector<AST::Expr*> pExprs;
      for (;;) {
        if (AST::Expr *pExpr = parseExpr()) {
          pExprs.push_back(pExpr);
        } else break;
        // special error checking for missing ';' (usual error)
        if (_lookAhead.tk != Token::Semicolon) {
          std::cerr << "SYNTAX ERROR: Semicolon expected!" << std::endl;
          break;
        }
        scan(); // consume semicolon
        if (_lookAhead.tk == Token::EOT) return pExprs;
      }
      // error handling
      for (AST::Expr *pExpr : pExprs) delete pExpr;
      pExprs.clear();
      return pExprs;
    }

    AST::Expr* parseExpr()
    {
      return parseAddExpr();
    }

    AST::Expr* parseAddExpr()
    {
      if (AST::Expr *pExpr1 = parseMulExpr()) {
        return parseAddExprRest(pExpr1);
      } else return nullptr; // ERROR!
    }

    AST::Expr* parseAddExprRest(AST::Expr *pExpr1)
    {
      // right recursive rule for left associative operators
      // -> can be done as iteration
      for (;;) {
        switch (_lookAhead.tk) {
          case Token::Plus:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseMulExpr()) {
              pExpr1 = new AST::ExprBinOpAdd(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Minus:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseMulExpr()) {
              pExpr1 = new AST::ExprBinOpSub(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Error:
            std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
            delete pExpr1;
            return nullptr;
          default: return pExpr1;
        }
      }
    }

    AST::Expr* parseMulExpr()
    {
      if (AST::Expr *pExpr1 = parseUnExpr()) {
        return parseMulExprRest(pExpr1);
      } else return nullptr; // ERROR!
    }

    AST::Expr* parseMulExprRest(AST::Expr *pExpr1)
    {
      // right recursive rule for left associative operators
      // -> can be done as iteration
      for (;;) {
        switch (_lookAhead.tk) {
          case Token::Star:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseUnExpr()) {
              pExpr1 = new AST::ExprBinOpMul(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Slash:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseUnExpr()) {
              pExpr1 = new AST::ExprBinOpDiv(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Error:
            std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
            delete pExpr1;
            return nullptr;
          default: return pExpr1;
        }
      }
    }

    AST::Expr* parseUnExpr()
    {
      // right recursive rule for right associative operators
      // -> must be done as recursion
      switch (_lookAhead.tk) {
        case Token::Plus:
          scan(); // consume token
          // as a unary plus has no effect it is simply ignored
          return parseUnExpr();
        case Token::Minus:
          scan();
          if (AST::Expr *pExpr = parseUnExpr()) {
            return new AST::ExprUnOpNeg(pExpr);
          } else return nullptr; // ERROR!
        default:
          return parsePrimExpr();
      }
    }

    AST::Expr* parsePrimExpr()
    {
      AST::Expr *pExpr = nullptr;
      switch (_lookAhead.tk) {
        case Token::Number:
          pExpr = new AST::ExprConst(_lookAhead.number);
          scan(); // consume token
          break;
        case Token::Id: {
          Var &var = _varTable[_lookAhead.id]; // find or create
          pExpr = new AST::ExprVar(&var);
          scan(); // consume token
        } break;
        case Token::LParen:
          scan(); // consume token
          if (!(pExpr = parseExpr())) return nullptr; // ERROR!
          if (!match(Token::RParen)) {
            delete pExpr; return nullptr; // ERROR!
          }
          break;
        case Token::EOT:
          std::cerr << "SYNTAX ERROR: Premature EOF!" << std::endl;
          break;
        case Token::Error:
          std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
          break;
        default:
          std::cerr << "SYNTAX ERROR: Unexpected token!" << std::endl;
      }
      return pExpr;
    }

  public:

    // the parser function
    static std::vector<AST::Expr*> parse(
      std::istream &in, VarTable &varTable)
    {
      Parser parser(in, varTable);
      return parser.parseProgram();
    }
};

Basically, the parser consists essentially of a bunch of rule functions (according to the grammar rules) which are calling each other. The class around the rule functions is responsible to manage some global parser context. Hence, the only public method of class Parser is

static std::vector<AST::Expr*> Parser::parse();

which constructs an instance (with the private constructor) and calls the function Parser::parseProgram() corresponding to the start symbol program.

Internally, the parser calls the Scanner::scan() method to fill its look-ahead token.

This is done in Parser::scan() which is called always when a token has to be consumed.

Looking closer, a pattern becomes visible how the rules have been translated into the parser functions:

  • Each non-terminal on the left side becomes a parse function. (Looking closer in the source code, you will realize that I didn’t do this exactly. Some of the rules have been “inlined”. – From my point of view, I inserted some extra rules to simplify the grammar which I didn’t intend to transform from beginning. Sorry.)

  • Alternatives (|) are implemented as switch (_lookAhead.tk). Thereby, each case label corresponds to the first terminal(s) (token(s)) to which the left most symbol may expand. (I believe that’s why it is called “look-ahead parser” – decisions to apply rules are always done based on the look ahead token.) The dragon book has a subject about FIRST-FOLLOW sets which explains this in more detail.

  • For terminal symbols, Parser::scan() is called. In special cases, it is replaced by Parser::match() if precisely one terminal (token) is expected.

  • For non-terminal symbols, the call of the corresponding function is done.

  • Symbol sequences are simply done as sequences of the above calls.

The error-handling of this parser is the most simple I ever did. It could/should be done much more supporting (investing more effort, i.e. additional lines of code). (…but here I tried to keep it minimal…)

Put Together

For testing and demonstration, I prepared a main() function with some built-in samples (source code of program and data to process):

// a sample application

using namespace std;

int main()
{
  // the program:
  const char *sourceCode =
    "1 + 2 * 3 / 4 - 5;\n"
    "a + b;\n"
    "a - b;\n"
    "a * b;\n"
    "a / b;\n"
    "a * (b + 3);\n";
  // the variables
  const char *vars[] = { "a", "b" };
  enum { nVars = sizeof vars / sizeof *vars };
  // the data
  const double data[][nVars] = {
    { 4.0, 2.0 },
    { 2.0, 4.0 },
    { 10.0, 5.0 },
    { 42, 6 * 7 }
  };
  // compile program
  stringstream in(sourceCode);
  VarTable varTable;
  vector<AST::Expr*> program = Parser::parse(in, varTable);
  if (program.empty()) {
    cerr << "ERROR: Compile failed!" << endl;
    string line;
    if (getline(in, line)) {
      cerr << "Text at error: '" << line << "'" << endl;
    }
    return 1;
  }
  // apply program to the data
  enum { nDataSets = sizeof data / sizeof *data };
  for (size_t i = 0; i < nDataSets; ++i) {
    const char *sep = "";
    cout << "Data Set:" << endl;
    for (size_t j = 0; j < nVars; ++j, sep = ", ") {
      cout << sep << vars[j] << ": " << data[i][j];
    }
    cout << endl;
    // load data
    for (size_t j = 0; j < nVars; ++j) varTable[vars[j]].set(data[i][j]);
    // perform program
    cout << "Compute:" << endl;
    istringstream in(sourceCode);
    for (const AST::Expr *pExpr : program) {
      string line; getline(in, line);
      cout << line << ": " << pExpr->solve() << endl;
    }
    cout << endl;
  }
  // clear the program
  for (AST::Expr *pExpr : program) delete pExpr;
  program.clear();
  // done
  return 0;  
}

I compiled and tested on VS2013 (Windows 10) and got:

Data Set:
a: 4, b: 2
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 6
a - b;: 2
a * b;: 8
a / b;: 2
a * (b + 3);: 20

Data Set:
a: 2, b: 4
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 6
a - b;: -2
a * b;: 8
a / b;: 0.5
a * (b + 3);: 14

Data Set:
a: 10, b: 5
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 15
a - b;: 5
a * b;: 50
a / b;: 2
a * (b + 3);: 80

Data Set:
a: 42, b: 42
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 84
a - b;: 0
a * b;: 1764
a / b;: 1
a * (b + 3);: 1890

Please, consider that the parser itself ignores any spaces and line-breaks. However, to make the sample output formatting simple, I have to use one semicolon-terminated expression per line. Otherwise, it will be difficult to associate the source code lines with the corresponding compiled expressions.
(Remember my note above about the Token to which a source code reference (aka file position) might be added.)

The Complete Sample

…with source code and test run can be found on ideone.

Leave a Comment