information-retrieval - w3toppers.com

How to extract Highlighted Parts from PDF files

To extract highlighted parts, you can use PyMuPDF. Here is an example which works with this pdf file: Direct download # Based on https://stackoverflow.com/a/62859169/562769 from typing import List, Tuple import fitz # install with ‘pip install pymupdf’ def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str: points = annot.vertices quad_count … Read more

What is the default list of stopwords used in Lucene’s StopFilter?

The default stop words set in StandardAnalyzer and EnglishAnalyzer is from StopAnalyzer.ENGLISH_STOP_WORDS_SET, as found in the source file: “a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”, “in”, “into”, “is”, “it”, “no”, “not”, “of”, “on”, “or”, “such”, “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”, “was”, “will”, “with” StopFilter itself defines no … Read more

Image retrieval system by Colour from the web using C++ with openframeworks

As I mentioned in my comment, it’s a matter of converting from RGB colourspace to Lab* colourspace and using the euclidean distance to the average colour of the image from the database. Here’s a basic demo: #include “testApp.h” //ported from http://cookbooks.adobe.com/post_Useful_color_equations__RGB_to_LAB_converter-14227.html struct Color{ float R,G,B,X,Y,Z,L,a,b; }; #define REF_X 95.047; // Observer= 2°, Illuminant= D65 #define … Read more

How to parse the data from Google Alerts?

When you create the alert, set the “Deliver To” to “Feed” and then you can consume the feed XML as you would any other feed. This is much easier to parse and digest into a database.

Fast/Optimize N-gram implementations in python

Some attempts with some profiling. I thought using generators could improve the speed here. But the improvement was not noticeable compared to a slight modification of the original. But if you don’t need the full list at the same time, the generator functions should be faster. import timeit from itertools import tee, izip, islice def … Read more

Python: tf-idf-cosine: to find document similarity

First off, if you want to extract count features and apply TF-IDF normalization and row-wise euclidean normalization you can do it in one operation with TfidfVectorizer: >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups() >>> tfidf = TfidfVectorizer().fit_transform(twenty.data) >>> tfidf <11314×130088 sparse matrix of type ‘<type ‘numpy.float64′>’ with 1787553 … Read more