Natural Language Processing in Ruby [closed]

Three excellent and mature NLP packages are Stanford Core NLP, Open NLP and LingPipe. There are Ruby bindings to the Stanford Core NLP tools (GPL license) as well as the OpenNLP tools (Apache License).

On the more experimental side of things, I maintain a Text Retrieval, Extraction and Annotation Toolkit (Treat), released under the GPL, that provides a common API for almost every NLP-related gem that exists for Ruby. The following list of Treat’s features can also serve as a good reference in terms of stable natural language processing gems compatible with Ruby 1.9.

  • Text segmenters and tokenizers (punkt-segmenter, tactful_tokenizer, srx-english, scalpel)
  • Natural language parsers for English, French and German and named entity extraction for English (stanford-core-nlp).
  • Word inflection and conjugation (linguistics), stemming (ruby-stemmer, uea-stemmer, lingua, etc.)
  • WordNet interface (rwordnet), POS taggers (rbtagger, engtagger, etc.)
  • Language (whatlanguage), date/time (chronic, kronic, nickel), keyword (lda-ruby) extraction.
  • Text retrieval with indexation and full-text search (ferret).
  • Named entity extraction (stanford-core-nlp).
  • Basic machine learning with decision trees (decisiontree), MLPs (ruby-fann), SVMs (rb-libsvm) and linear classification (tomz-liblinear-ruby-swig).
  • Text similarity metrics (levenshtein-ffi, fuzzy-string-match, tf-idf-similarity).

Not included in Treat, but relevant to NLP: hotwater (string distance algorithms), yomu (binders to Apache Tiki for reading .doc, .docx, .pages, .odt, .rtf, .pdf), graph-rank (an implementation of GraphRank).

Leave a Comment