How to make my split work only on one real line and be capable to skip quoted parts of string?

Updated By way of ‘thank you’ for awarding the bonus I went and implemented 4 features that I initially skipped as “You Ain’t Gonna Need It”.

  1. now supports partially quoted columns

    This is the problem you reported: e.g. with a delimiter , only test,"one,two",three would be valid, not test,one","two","three. Now both are accepted

  2. now supports custom delimiter expressions

    You could only specify single characters as delimiters. Now you can specify any Spirit Qi parser expression as the delimiter rule. E.g

      splitInto(input, output, ' ');             // single space
      splitInto(input, output, +qi.lit(' '));    // one or more spaces
      splitInto(input, output, +qi.lit(" \t"));  // one or more spaces or tabs
      splitInto(input, output, (qi::double_ >> !'#') // -- any parse expression
    

    Note this changes behaviour for the default overload

    The old version treated repeated spaces as a single delimiter by default. You now have to explicitly specify that (2nd example) if you want it.

  3. now supports quotes (“”) inside quoted values (instead of just making them disappear)

    See the code sample. Quite simple of course. Note that the sequence "" outside a quoted construct still represents the empty string (for compatibility with e.g. existing CSV output formats which quote empty strings redundantly)

  4. support boost ranges in addition to containers as input (e.g. char[])

    Well, you ain’t gonna need it (but it was rather handy for me in order to just be able to write splitInto("a char array", ...) 🙂

As I had half expected, you were gonna need partially quoted fields (see your comment1. Well, here you are (the bottleneck was getting it to work consistently across different versions of Boost)).

Introduction

Random notes and observations for the reader:

  • splitInto template function happily supports whatever you throw at it:

    • input from a vector or std::string or std::wstring
    • output to — some combinations shown in demo
      • vector<string> (all lines flattened)
      • vector<vector<string>> (tokens per line)
      • list<list<string>> (if you prefer)
      • set<set<string>> (unique linewise tokensets)
      • … any container you dream up
  • for demo purposes showing off karma output generation (especially taking care of nested container)
    • note: \n in output being shown as ? for comprehension (safechars)
  • complete with handy plumbing for new Spirit users (legible rule naming, commented DEBUG defines in case you want to play with things)
  • you can specify any Spirit parse expression to match delimiters. This means that by passing +qi::lit(' ') instead of the default (' ') you will skip empty fields (i.e. repeated delimiters)

Versions required/tested

This was compiled using

  • gcc 4.4.5,
  • gcc 4.5.1 and
  • gcc 4.6.1.

It works (tested) against

  • boost 1.42.0 (possibly earlier versions too) all the way through
  • boost 1.47.0.

Note: The flattening of output containers only seems to work for Spirit V2.5 (boost 1.47.0).

(this might be something simple as needing an extra include for older versions?)

The Code!

//#define BOOST_SPIRIT_DEBUG
#define BOOST_SPIRIT_DEBUG_PRINT_SOME 80

// YAGNI #4 - support boost ranges in addition to containers as input (e.g. char[])
#define SUPPORT_BOOST_RANGE // our own define for splitInto
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/spirit/include/phoenix.hpp> // for pre 1.47.0 boost only
#include <boost/spirit/version.hpp>
#include <sstream>

namespace /*anon*/
{
    namespace phx=boost::phoenix;
    namespace qi =boost::spirit::qi;
    namespace karma=boost::spirit::karma;

    template <typename Iterator, typename Output> 
        struct my_grammar : qi::grammar<Iterator, Output()>
    {
        typedef qi::rule<Iterator> delim_t;

        //my_grammar(delim_t const& _delim) : delim(_delim),
        my_grammar(delim_t _delim) : delim(_delim),
            my_grammar::base_type(rule, "quoted_delimited")
        {
            using namespace qi;

            noquote = char_ - '"';
            plain   = +((!delim) >> (noquote - eol));
            quoted  = lit('"') > *(noquote | '"' >> char_('"')) > '"';

#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
            mixed   = *(quoted|plain);
#else
            // manual folding
            mixed   = *( (quoted|plain) [_a << _1]) [_val=_a.str()];
#endif

            // you gotta love simple truths:
            rule    = mixed % delim % eol;

            BOOST_SPIRIT_DEBUG_NODE(rule);
            BOOST_SPIRIT_DEBUG_NODE(plain);
            BOOST_SPIRIT_DEBUG_NODE(quoted);
            BOOST_SPIRIT_DEBUG_NODE(noquote);
            BOOST_SPIRIT_DEBUG_NODE(delim);
        }

      private:
        qi::rule<Iterator>                  delim;
        qi::rule<Iterator, char()>          noquote;
#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
        qi::rule<Iterator, std::string()>   plain, quoted, mixed;
#else
        qi::rule<Iterator, std::string()>   plain, quoted;
        qi::rule<Iterator, std::string(), qi::locals<std::ostringstream> > mixed;
#endif
        qi::rule<Iterator, Output()> rule;
    };
}

template <typename Input, typename Container, typename Delim>
    bool splitInto(const Input& input, Container& result, Delim delim)
{
#ifdef SUPPORT_BOOST_RANGE
    typedef typename boost::range_const_iterator<Input>::type It;
    It first(boost::begin(input)), last(boost::end(input));
#else
    typedef typename Input::const_iterator It;
    It first(input.begin()), last(input.end());
#endif

    try
    {
        my_grammar<It, Container> parser(delim);

        bool r = qi::parse(first, last, parser, result);

        r = r && (first == last);

        if (!r)
            std::cerr << "parsing failed at: \"" << std::string(first, last) << "\"\n";
        return r;
    }
    catch (const qi::expectation_failure<It>& e)
    {
        std::cerr << "FIXME: expected " << e.what_ << ", got '";
        std::cerr << std::string(e.first, e.last) << "'" << std::endl;
        return false;
    }
}

template <typename Input, typename Container>
    bool splitInto(const Input& input, Container& result)
{
    return splitInto(input, result, ' '); // default space delimited
}


/********************************************************************
 * replaces '\n' character by '?' so that the demo output is more   *
 * comprehensible (see when a \n was parsed and when one was output *
 * deliberately)                                                    *
 ********************************************************************/
void safechars(char& ch)
{
    switch (ch) { case '\r': case '\n': ch="?"; break; }
}

int main()
{
    using namespace karma; // demo output generators only :)
    std::string input;

#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
    // sample invocation: simple vector of elements in order - flattened across lines
    std::vector<std::string> flattened;

    input = "actually on\ntwo lines";
    if (splitInto(input, flattened))
        std::cout << format(*char_[safechars] % '|', flattened) << std::endl;
#endif
    std::list<std::set<std::string> > linewise, custom;

    // YAGNI #1 - now supports partially quoted columns
    input = "partially q\"oute\"d columns";
    if (splitInto(input, linewise))
        std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '\n', linewise) << std::endl;

    // YAGNI #2 - now supports custom delimiter expressions
    input="custom delimiters: 1997-03-14 10:13am"; 
    if (splitInto(input, custom, +qi::char_("- 0-9:"))
     && splitInto(input, custom, +(qi::char_ - qi::char_("0-9"))))
        std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '\n', custom) << std::endl;

    // YAGNI #3 - now supports quotes ("") inside quoted values (instead of just making them disappear)
    input = "would like ne\"\"sted \"quotes like \"\"\n\"\" that\"";
    custom.clear();
    if (splitInto(input, custom, qi::char_("() ")))
        std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '\n', custom) << std::endl;

    return 0;
}

The Output

Output from the sample as shown:

actually|on|two|lines
set['columns', 'partially', 'qouted']
set['am', 'custom', 'delimiters']
set['', '03', '10', '13', '14', '1997']
set['like', 'nested', 'quotes like "?" that', 'would']

Update Output for your previously failing test case:

--server=127.0.0.1:4774/|--username=robota|--userdescr=robot A ? I am cool robot ||--robot|>|echo.txt

1 I must admit I had a good laugh when reading that ‘it crashed’ [sic]. That sounds a lot like my end-users. Just to be precise: a crash is an unrecoverable application failure. What you ran into was a handled error, and was nothing more than ‘unexpected behavior’ from your point of view. Anyways, that’s fixed now 🙂

Leave a Comment