Looking for a clear definition of what a “tokenizer”, “parser” and “lexers” are and how they are related to each other and used?

A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines). A lexer is basically a tokenizer, but it usually attaches extra context to the tokens — this token is a number, that token is a string literal, this other token is an equality operator. A parser takes … Read more

How to Traverse an NLTK Tree object?

Maybe I’m overlooking things, but is this what you’re after? import nltk s=”(ROOT (S (NP (NNP Europe)) (VP (VBZ is) (PP (IN in) (NP (DT the) (JJ same) (NNS trends)))) (. .)))” tree = nltk.tree.Tree.fromstring(s) def traverse_tree(tree): # print(“tree:”, tree) for subtree in tree: if type(subtree) == nltk.tree.Tree: traverse_tree(subtree) traverse_tree(tree) It traverses your tree depth-first.

Handling extra operators in Shunting-yard

Valid expressions can be validated with a regular expression, aside from parenthesis mismatching. (Mismatched parentheses will be caught by the shunting-yard algorithm as indicated in the wikipedia page, so I’m ignoring those.) The regular expression is as follows: PRE* OP POST* (INF PRE* OP POST*)* where: PRE is a prefix operator or ( POST is … Read more

How would you go about parsing Markdown? [closed]

The only markdown implementation I know of, that uses an actual parser, is Jon MacFarleane’s peg-markdown. Its parser is based on a Parsing Expression Grammar parser generator called peg. EDIT: Mauricio Fernandez recently released his Simple Markup Markdown parser, which he wrote as part of his OcsiBlog Weblog Engine. Because the parser is written in … Read more

How does the ANTLR lexer disambiguate its rules (or why does my parser produce “mismatched input” errors)?

In ANTLR, the lexer is isolated from the parser, which means it will split the text into typed tokens according to the lexer grammar rules, and the parser has no influence on this process (it cannot say “give me an INTEGER now” for instance). It produces a token stream by itself. Furthermore, the parser doesn’t … Read more

Parsing command line arguments in R scripts

There are three packages on CRAN: getopt: C-like getopt behavior optparse: a command line parser inspired by Python’s optparse library argparse: a command line optional and positional argument parser (inspired by Python’s argparse library). This package requires that a Python interpreter be installed with the argparse and json (or simplejson) modules. Update: docopt: lets you … Read more