Nokogiri, open-uri, and Unicode Characters

Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(…).read and pass the resulting string to Nokogiri. Analysis: If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. “GenealogĂ­a de Jesucristo”. But even with a magic comment on the Ruby file and setting … Read more

Installing Nokogiri on OSX 10.10 Yosemite

I managed to install Nokogiri under Yosemite (OS X 10.10 Preview). Step 1: Install Brew Skip this if brew was installed. ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)” Step 2: Install brew libs brew tap homebrew/dupes brew install libxml2 libxslt brew install libiconv Step 3: Download and install Apple Commandline Tools for 10.10 It’s important that you … Read more

Mac user and getting WARNING: Nokogiri was built against LibXML version 2.7.8, but has dynamically loaded 2.7.3

If you installed Nokogiri with gem install nokogiri, you can resolve this warning by running gem pristine nokogiri to recompile the gem’s C extension. If you installed Nokogiri with bundle install, you can resolve this warning by running bundle exec gem pristine nokogiri to recompile the C extension of the gem wherever Bundler installed it.

Nokogiri/Xpath namespace query

All namespaces need to be registered when parsing. Nokogiri automatically registers namespaces on the root node. Any namespaces that are not on the root node you have to register yourself. This should work: puts doc.xpath(‘//dc:title’, ‘dc’ => “URI”) Alternately, you can remove namespaces altogether. Only do this if you are certain there will be no … Read more

HTML-parser on Node.js [closed]

If you want to build DOM you can use jsdom. There’s also cheerio, it has the jQuery interface and it’s a lot faster than older versions of jsdom, although these days they are similar in performance. You might wanna have a look at htmlparser2, which is a streaming parser, and according to its benchmark, it … Read more