How to provider user with autocomplete suggestions for given boost::spirit grammar?

Out of curiosity, I decided to try the concept.

Here’s my attempt.

Plan

Let’s parse arithmetic expressions with function calls.

Now, we want to parse (partial) expression with possible unknown identifiers.

In case of incomplete expressions, we want to “imply” the minimal token to complete it (and potentially continue parsing).

In case of unknown identifiers, we want to fuzzy-match the known identifiers in the domain (either variables or functions) and rank them in order of decreasing probability.


Base Definitions

Let’s start out by deciding we want our input to be in memory, so we can refer to locations/substrings by using string_views:

#include <boost/utility/string_view.hpp>

using Source = boost::string_view;
using Location = Source::const_iterator;

Completion Hints

Besides the AST, we want our parser to generate completion hints (the implicit completion tokens and suggestions)

namespace Completion {

    using Candidates = std::vector<std::string>;

    class Hints {
        struct ByLocation {
            template <typename T, typename U>
            bool operator()(T const& a, U const& b) const { return loc(a) < loc(b); }
          private:
            static Location loc(Source const& s) { return s.begin(); }
            static Location loc(Location const& l) { return l; }
        };

      public:
        std::map<Location, std::string, ByLocation> incomplete;
        std::map<Source, Candidates, ByLocation> suggestions;

        /*explicit*/ operator bool() const { return incomplete.size() || suggestions.size(); }
    };

In addition, let’s code up a quick & dirty fuzzy identifier match scoring function.

I opted for a simple synchronizing compare that

  • scores corresponding runs of characters progressively, and
  • favours skipping characters from candidates over skipping characters from the input (meaning that adj_diff is a good fuzzy match for adjacent_difference even though characters have been skipped from the candidate, but adj_qqq_diff is worse because the qqq from the input could not be matched)
  • the algorithm is done recursively and without allocations
  • first characters get a boost if matched (rate=1 at first invocation)
    static int fuzzy_match(Source input, boost::string_view candidate, int rate = 1) { // start with first-letter boost
        int score = 0;

        while (!(input.empty() || candidate.empty())) {
            if (input.front() != candidate.front()) {
                return score + std::max(
                    fuzzy_match(input.substr(1), candidate, std::max(rate-2,0)), //penalty for ignoring an input char
                    fuzzy_match(input, candidate.substr(1), std::max(rate-1,0)));
            }

            input.remove_prefix(1);
            candidate.remove_prefix(1);
            score += ++rate;
        }
        return score;
    }

} // namespace Completion

We’ll see how this is used in the grammar.

AST

A run-of-the-mill AST that can do binary expressions, string/numeric literals, variables and (nested) function calls:

#include <boost/variant.hpp>

namespace Ast {
    using NumLiteral = double;
    using StringLiteral = std::string; // de-escaped, not source view

    struct Identifier : std::string {
        using std::string::string;
        using std::string::operator=;
    };

    struct BinaryExpression;
    struct CallExpression;

    using Expression = boost::variant<
            NumLiteral,
            StringLiteral,
            Identifier,
            boost::recursive_wrapper<BinaryExpression>,
            boost::recursive_wrapper<CallExpression>
        >;

    struct BinaryExpression {
        Expression lhs;
        char op;
        Expression rhs;
    };

    using ArgList = std::vector<Expression>;

    struct CallExpression {
        Identifier function;
        ArgList args;
    };
}

Grammar


Basics

The grammar starts off pretty “basic” as well:

namespace Parsing {
    namespace qi  = boost::spirit::qi;
    namespace phx = boost::phoenix;

    template <typename It>
    struct Expression : qi::grammar<It, Ast::Expression()> {
        Expression(Completion::Hints& hints) : Expression::base_type(start), _hints(hints) {
            using namespace qi;

            start      = skip(space) [expression];

            expression = term   [_val = _1] >> *(char_("-+") >> expression) [_val = make_binary(_val, _1, _2)];
            term       = factor [_val = _1] >> *(char_("*/") >> term)       [_val = make_binary(_val, _1, _2)];
            factor     = simple [_val = _1] >> *(char_("^")  >> factor)     [_val = make_binary(_val, _1, _2)];

            simple     = call | variable | compound | number | string;

So far so good: The constructor stores a reference to the Completion::Hints& to be recorded. All these rules have been declared as qi::rule<It, Expression(), qi::space_type>.

Implied Tokens

Now it gets slightly more interesting, some rules here are lexemes¹

            number     = double_;
            identifier = raw[(alpha|'_') >> *(alnum|'_')];

And some productions are tolerant of missing closing tokens:

            // imply the closing quotes
            string   %= '"' >> *('\\' >> char_ | ~char_('"')) >> implied('"');
            compound %= '(' >> expression >> implied(')');

The implied magic makes it so that if the expected closing token is missing, it will be recorded as a hint, and parsing continues:

            auto implied = [=](char ch) {
                return copy(omit[lit(ch) | raw[eps][imply(_1, ch)]]);
            };

Of course, this begs the question what imply(_1, ch) really does, and we’ll see later.

For now, observe that we do raw[eps] to get an (empty) source iterator_range to construct a Location from in the semantic action.

Identifier Lookup

We store “symbol tables” in Domain members _variables and _functions:

        using Domain = qi::symbols<char>;
        Domain _variables, _functions;

Then we declare some rules that can do lookups on either of them:

        // domain identifier lookups
        qi::_r1_type _domain;
        qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;

The corresponding declarations will be shown shortly.

Variables are pretty simple:

        variable   = maybe_known(phx::ref(_variables));

Calls are trickier. If a name is unknown we don’t want to assume it implies a function unless it’s followed by a '(' character. However, if an identifier is a known function name, we want even to imply the ( (this gives the UX the appearance of autocompletion where when the user types sqrt, it suggests the next character to be ( magically).

        // The heuristics:
        // - an unknown identifier followed by (
        // - an unclosed argument list implies )
        call %= ( known(phx::ref(_functions)) // known -> imply the parens
                     | &(identifier >> '(') >> unknown(phx::ref(_functions))
                     ) >> implied('(') >> -(expression % ',') >> implied(')');

It all builds on known, unknown and maybe_known:

            ///////////////////////////////
            // identifier loopkup, suggesting
            {
                maybe_known = known(_domain) | unknown(_domain);

                // distinct to avoid partially-matching identifiers
                using boost::spirit::repository::qi::distinct;
                auto kw     = distinct(copy(alnum | '_'));

                known       = raw[kw[lazy(_domain)]];
                unknown     = raw[identifier[_val=_1]] [suggest_for(_1, _domain)];
            }

Debug, Predefined Variables/Functions

Remaining is a bit of red tape:

            BOOST_SPIRIT_DEBUG_NODES(
                (start)
                (expression)(term)(factor)(simple)(compound)
                (call)(variable)
                (identifier)(number)(string)
                //(maybe_known)(known)(unknown) // qi::symbols<> non-streamable
            )

            _variables += "foo", "bar", "qux";
            _functions += "print", "sin", "tan", "sqrt", "frobnicate";
        }

      private:
        Completion::Hints& _hints;

        using Domain = qi::symbols<char>;
        Domain _variables, _functions;

        qi::rule<It, Ast::Expression()> start;
        qi::rule<It, Ast::Expression(), qi::space_type> expression, term, factor, simple;
        // completables
        qi::rule<It, Ast::Expression(), qi::space_type> compound;
        qi::rule<It, Ast::CallExpression(), qi::space_type> call;

        // implicit lexemes
        qi::rule<It, Ast::Identifier()> variable, identifier;
        qi::rule<It, Ast::NumLiteral()> number;
        qi::rule<It, Ast::StringLiteral()> string;

        // domain identifier lookups
        qi::_r1_type _domain;
        qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;

Phoenix Helpers

Let’s start with the usual helper to construct binary AST nodes:

        ///////////////////////////////
        // binary expression factory
        struct make_binary_f {
            Ast::BinaryExpression operator()(Ast::Expression const& lhs, char op, Ast::Expression const& rhs) const {
                return {lhs, op, rhs};
            }
        };
        boost::phoenix::function<make_binary_f> make_binary;

Similarly, we can have a Phoenix function object to register implied chars:

        ///////////////////////////////
        // auto-completing incomplete expressions
        struct imply_f {
            Completion::Hints& _hints;
            void operator()(boost::iterator_range<It> where, char implied_char) const {
                auto inserted = 
                    _hints.incomplete.emplace(&*where.begin(), std::string(1, implied_char));
                // add the implied char to existing completion
                if (!inserted.second)
                    inserted.first->second += implied_char;
            }
        };
        boost::phoenix::function<imply_f> imply { imply_f { _hints } };

Note that it orders by location (which makes UX easier) and detects when a previous implied token lived at the same location, in which case we simply append to it.

By now it won’t be much of a surprise that generating relevant suggestions for unknown identifiers is also a phoenix actor:

        ///////////////////////////////
        // suggest_for
        struct suggester {
            Completion::Hints& _hints;

            void operator()(boost::iterator_range<It> where, Domain const& symbols) const {
                using namespace Completion;
                Source id(&*where.begin(), where.size());
                Candidates c;

                symbols.for_each([&](std::string const& k, ...) { c.push_back(k); });

                auto score = [id](Source v) { return fuzzy_match(id, v); };
                auto byscore = [=](Source a, Source b) { return score(a) > score(b); };

                sort(c.begin(), c.end(), byscore);
                c.erase( remove_if(c.begin(), c.end(), [=](Source s) { return score(s) < 3; }), c.end());

                _hints.suggestions.emplace(id, c);
            }
        };
        boost::phoenix::function<suggester> suggest_for {suggester{_hints}};

That’s all. If it looks more complicated, that’s because it does a lot more: it scores all candidates fuzzily, orders them by descending score and removes any candidates with score < 3.

    };
}

BONUS

Let’s alter things a little more and allow ',' to be implied inside argument lists:

        call %= ( known(phx::ref(_functions)) // known -> imply the parens
                | &(identifier >> '(') >> unknown(phx::ref(_functions))
                ) 
            >> implied('(') 
            >> -(expression % ',')
            >> implied(')')
            ;

Let’s replace ',' there:

            >> -(expression % (',' | !(')'|eoi) >> implied(',')))

NOTE: It might seem to make more sense to detect &expression to assert that an expression follows, instead asserting that the end of the input/argument list has not been reached.

Doing that goes bad, though, because any contained expressions could trigger more implied tokens as a side effect, leading to duplicated implied tokens.

This (side-effectful semantic actions) is one of the chief reasons why I usually avoid semantic actions, or use them to store effect only in the rule’s (exposed) attribute(s). See Boost Spirit: “Semantic actions are evil”?

TEST DRIVER

This post would be nothing without some nice test cases. So here goes:

int main() {
    for (Source const input : {
            "",       // invalid
            "(3",     // incomplete, imply ')'
            "3*(6+sqrt(9))^7 - 1e8", // completely valid
            "(3*(((6+sqrt(9))^7 - 1e8", // incomplete, imply ")))"
            "print(\"hello \\\"world!", // completes the string literal and the parameter list
            "foo",    // okay, known variable
            "baz",    // (suggest bar)
            "baz(",   // incomplete, imply ')', unknown function
            "taz(",   // incomplete, imply ')', unknown function
            "san(",   // 2 suggestions (sin/tan)
            "print(1, 2, \"three\", complicated(san(78",
            "(print sqrt sin 9)    -0) \"bye",
        })
    {
        std::cout << "-------------- '" << input << "'\n";
        Location f = input.begin(), l = input.end();

        Ast::Expression expr;
        Completion::Hints hints;
        bool ok = parse(f, l, Parsing::Expression<Location>{hints}, expr);

        if (hints) {
            std::cout << "Input: '" << input << "'\n";
        }
        for (auto& c : hints.incomplete) {
            std::cout << "Missing " << std::setw(c.first - input.begin()) << "" << "^ implied: '" << c.second << "'\n";
        }
        for (auto& id : hints.suggestions) {
            std::cout << "Unknown " << std::setw(id.first.begin() - input.begin()) << "" << std::string(id.first.size(), '^');
            if (id.second.empty()) 
                std::cout << " (no suggestions)\n";
            else {
                std::cout << " (did you mean ";
                once_t first;
                for (auto& s : id.second)
                    std::cout << (first?"":" or ") << "'" << s << "'";
                std::cout << "?)\n";
            }
        }

        if (ok) {
            std::cout << "AST:    " << expr << "\n";
        } else {
            std::cout << "Parse failed\n";
        }

        if (f!=l)
            std::cout << "Remaining input: '" << std::string(f,l) << "'\n";
    }
}

Note that, besides the first input ("") everything is heuristically parsed as some kind of expression! The last one is designed to show off how all the parameter lists are implied because print, sqrt and sin are known functions. Then some ',' are implied, before finally the unclosed string literal and remaining parentheses are closed. Here’s the (non-debug) output:

-------------- ''
Parse failed
-------------- '(3'
Input: '(3'
Missing   ^ implied: ')'
AST:    3
-------------- '3*(6+sqrt(9))^7 - 1e8'
AST:    ((3 * ((6 + (sqrt (9))) ^ 7)) - 1e+08)
-------------- '(3*(((6+sqrt(9))^7 - 1e8'
Input: '(3*(((6+sqrt(9))^7 - 1e8'
Missing                         ^ implied: ')))'
AST:    (3 * (((6 + (sqrt (9))) ^ 7) - 1e+08))
-------------- 'print("hello \"world!'
Input: 'print("hello \"world!'
Missing                      ^ implied: '")'
AST:    (print (hello "world!))
-------------- 'foo'
AST:    foo
-------------- 'baz'
Input: 'baz'
Unknown ^^^ (did you mean 'bar'?)
AST:    baz
-------------- 'baz('
Input: 'baz('
Missing     ^ implied: ')'
Unknown ^^^ (no suggestions)
AST:    (baz ())
-------------- 'taz('
Input: 'taz('
Missing     ^ implied: ')'
Unknown ^^^ (did you mean 'tan'?)
AST:    (taz ())
-------------- 'san('
Input: 'san('
Missing     ^ implied: ')'
Unknown ^^^ (did you mean 'sin' or 'tan'?)
AST:    (san ())
-------------- 'print(1, 2, "three", complicated(san(78'
Input: 'print(1, 2, "three", complicated(san(78'
Missing                                        ^ implied: ')))'
Unknown                      ^^^^^^^^^^^ (did you mean 'frobnicate' or 'print'?)
Unknown                                  ^^^ (did you mean 'sin' or 'tan'?)
AST:    (print (1, 2, three, (complicated ((san (78))))))
-------------- '(print sqrt sin 9)    -0) "bye'
Input: '(print sqrt sin 9)    -0) "bye'
Missing        ^ implied: '('
Missing             ^ implied: '('
Missing                 ^ implied: '('
Missing                           ^ implied: ','
Missing                               ^ implied: '"))'
AST:    (print ((sqrt (((sin (9)) - 0))), bye))

Full Listing / Live Demo

Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/utility/string_view.hpp>

using Source = boost::string_view;
using Location = Source::const_iterator;

#include <map>
#include <vector>

namespace Completion {

    static int fuzzy_match(Source input, boost::string_view candidate, int rate = 1) { // start with first-letter boost
        int score = 0;

        while (!(input.empty() || candidate.empty())) {
            if (input.front() != candidate.front()) {
                return score + std::max(
                    fuzzy_match(input.substr(1), candidate, std::max(rate-2,0)), //penalty for ignoring an input char
                    fuzzy_match(input, candidate.substr(1), std::max(rate-1,0)));
            }

            input.remove_prefix(1);
            candidate.remove_prefix(1);
            score += ++rate;
        }
        return score;
    }

    using Candidates = std::vector<std::string>;

    class Hints {
        struct ByLocation {
            template <typename T, typename U>
            bool operator()(T const& a, U const& b) const { return loc(a) < loc(b); }
          private:
            static Location loc(Source const& s) { return s.begin(); }
            static Location loc(Location const& l) { return l; }
        };

      public:
        std::map<Location, std::string, ByLocation> incomplete;
        std::map<Source, Candidates, ByLocation> suggestions;

        /*explicit*/ operator bool() const { return incomplete.size() || suggestions.size(); }
    };
}

#include <boost/variant.hpp>

namespace Ast {
    using NumLiteral = double;
    using StringLiteral = std::string; // de-escaped, not source view

    struct Identifier : std::string {
        using std::string::string;
        using std::string::operator=;
    };

    struct BinaryExpression;
    struct CallExpression;

    using Expression = boost::variant<
            NumLiteral,
            StringLiteral,
            Identifier,
            boost::recursive_wrapper<BinaryExpression>,
            boost::recursive_wrapper<CallExpression>
        >;

    struct BinaryExpression {
        Expression lhs;
        char op;
        Expression rhs;
    };

    using ArgList = std::vector<Expression>;

    struct CallExpression {
        Identifier function;
        ArgList args;
    };
}

#include <boost/fusion/adapted/struct.hpp>
BOOST_FUSION_ADAPT_STRUCT(Ast::BinaryExpression, lhs, op, rhs)
BOOST_FUSION_ADAPT_STRUCT(Ast::CallExpression, function, args)

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/repository/include/qi_distinct.hpp>

// for debug printing:
namespace {
    struct once_t { // an auto-reset flag
        operator bool() { bool v = flag; flag = false; return v; }
        bool flag = true;
    };
}

// for debug printing:
namespace Ast {

    static inline std::ostream& operator<<(std::ostream& os, std::vector<Expression> const& args) {
        os << "(";
        once_t first;
        for (auto& a : args) os << (first?"":", ") << a;
        return os << ")";
    }

    static inline std::ostream& operator<<(std::ostream& os, BinaryExpression const& e) {
        return os << boost::fusion::as_vector(e);
    }
    static inline std::ostream& operator<<(std::ostream& os, CallExpression const& e) {
        return os << boost::fusion::as_vector(e);
    }
}

namespace Parsing {
    namespace qi  = boost::spirit::qi;
    namespace phx = boost::phoenix;

    template <typename It>
    struct Expression : qi::grammar<It, Ast::Expression()> {
        Expression(Completion::Hints& hints) : Expression::base_type(start), _hints(hints) {
            using namespace qi;

            start      = skip(space) [expression];

            expression = term   [_val = _1] >> *(char_("-+") >> expression) [_val = make_binary(_val, _1, _2)];
            term       = factor [_val = _1] >> *(char_("*/") >> term)       [_val = make_binary(_val, _1, _2)];
            factor     = simple [_val = _1] >> *(char_("^")  >> factor)     [_val = make_binary(_val, _1, _2)];

            simple     = call | variable | compound | number | string;

            auto implied = [=](char ch) {
                return copy(omit[lit(ch) | raw[eps][imply(_1, ch)]]);
            };

            variable   = maybe_known(phx::ref(_variables));

            compound  %= '(' >> expression >> implied(')');

            // The heuristics:
            // - an unknown identifier followed by (
            // - an unclosed argument list implies )
            call %= ( known(phx::ref(_functions)) // known -> imply the parens
                    | &(identifier >> '(') >> unknown(phx::ref(_functions))
                    ) 
                >> implied('(') 
                >> -(expression % (',' | !(')'|eoi) >> implied(',')))
                >> implied(')')
                ;

            // lexemes, primitive rules
            identifier  = raw[(alpha|'_') >> *(alnum|'_')];

            // imply the closing quotes
            string     %= '"' >> *('\\' >> char_ | ~char_('"')) >> implied('"'); // TODO more escapes

            number      = double_; // TODO integral arguments

            ///////////////////////////////
            // identifier loopkup, suggesting
            {
                maybe_known = known(_domain) | unknown(_domain);

                // distinct to avoid partially-matching identifiers
                using boost::spirit::repository::qi::distinct;
                auto kw     = distinct(copy(alnum | '_'));

                known       = raw[kw[lazy(_domain)]];
                unknown     = raw[identifier[_val=_1]] [suggest_for(_1, _domain)];
            }

            BOOST_SPIRIT_DEBUG_NODES(
                (start)
                (expression)(term)(factor)(simple)(compound)
                (call)(variable)
                (identifier)(number)(string)
                //(maybe_known)(known)(unknown) // qi::symbols<> non-streamable
            )

            _variables += "foo", "bar", "qux";
            _functions += "print", "sin", "tan", "sqrt", "frobnicate";
        }

      private:
        Completion::Hints& _hints;

        using Domain = qi::symbols<char>;
        Domain _variables, _functions;

        qi::rule<It, Ast::Expression()> start;
        qi::rule<It, Ast::Expression(), qi::space_type> expression, term, factor, simple;
        // completables
        qi::rule<It, Ast::Expression(), qi::space_type> compound;
        qi::rule<It, Ast::CallExpression(), qi::space_type> call;

        // implicit lexemes
        qi::rule<It, Ast::Identifier()> variable, identifier;
        qi::rule<It, Ast::NumLiteral()> number;
        qi::rule<It, Ast::StringLiteral()> string;

        // domain identifier lookups
        qi::_r1_type _domain;
        qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;

        ///////////////////////////////
        // binary expression factory
        struct make_binary_f {
            Ast::BinaryExpression operator()(Ast::Expression const& lhs, char op, Ast::Expression const& rhs) const {
                return {lhs, op, rhs};
            }
        };
        boost::phoenix::function<make_binary_f> make_binary;

        ///////////////////////////////
        // auto-completing incomplete expressions
        struct imply_f {
            Completion::Hints& _hints;
            void operator()(boost::iterator_range<It> where, char implied_char) const {
                auto inserted = 
                    _hints.incomplete.emplace(&*where.begin(), std::string(1, implied_char));
                // add the implied char to existing completion
                if (!inserted.second)
                    inserted.first->second += implied_char;
            }
        };
        boost::phoenix::function<imply_f> imply { imply_f { _hints } };

        ///////////////////////////////
        // suggest_for
        struct suggester {
            Completion::Hints& _hints;

            void operator()(boost::iterator_range<It> where, Domain const& symbols) const {
                using namespace Completion;
                Source id(&*where.begin(), where.size());
                Candidates c;

                symbols.for_each([&](std::string const& k, ...) { c.push_back(k); });

                auto score = [id](Source v) { return fuzzy_match(id, v); };
                auto byscore = [=](Source a, Source b) { return score(a) > score(b); };

                sort(c.begin(), c.end(), byscore);
                c.erase( remove_if(c.begin(), c.end(), [=](Source s) { return score(s) < 3; }), c.end());

                _hints.suggestions.emplace(id, c);
            }
        };
        boost::phoenix::function<suggester> suggest_for {suggester{_hints}};

    };
}

#include <iostream>
#include <iomanip>

int main() {
    for (Source const input : {
            "",       // invalid
            "(3",     // incomplete, imply ')'
            "3*(6+sqrt(9))^7 - 1e8", // completely valid
            "(3*(((6+sqrt(9))^7 - 1e8", // incomplete, imply ")))"
            "print(\"hello \\\"world!", // completes the string literal and the parameter list
            "foo",    // okay, known variable
            "baz",    // (suggest bar)
            "baz(",   // incomplete, imply ')', unknown function
            "taz(",   // incomplete, imply ')', unknown function
            "san(",   // 2 suggestions (sin/tan)
            "print(1, 2, \"three\", complicated(san(78",
            "(print sqrt sin 9)    -0) \"bye",
        })
    {
        std::cout << "-------------- '" << input << "'\n";
        Location f = input.begin(), l = input.end();

        Ast::Expression expr;
        Completion::Hints hints;
        bool ok = parse(f, l, Parsing::Expression<Location>{hints}, expr);

        if (hints) {
            std::cout << "Input: '" << input << "'\n";
        }
        for (auto& c : hints.incomplete) {
            std::cout << "Missing " << std::setw(c.first - input.begin()) << "" << "^ implied: '" << c.second << "'\n";
        }
        for (auto& id : hints.suggestions) {
            std::cout << "Unknown " << std::setw(id.first.begin() - input.begin()) << "" << std::string(id.first.size(), '^');
            if (id.second.empty()) 
                std::cout << " (no suggestions)\n";
            else {
                std::cout << " (did you mean ";
                once_t first;
                for (auto& s : id.second)
                    std::cout << (first?"":" or ") << "'" << s << "'";
                std::cout << "?)\n";
            }
        }

        if (ok) {
            std::cout << "AST:    " << expr << "\n";
        } else {
            std::cout << "Parse failed\n";
        }

        if (f!=l)
            std::cout << "Remaining input: '" << std::string(f,l) << "'\n";
    }
}

¹ Boost spirit skipper issues

Leave a Comment