How to extract Highlighted Parts from PDF files

To extract highlighted parts, you can use PyMuPDF. Here is an example which works with this pdf file: Direct download # Based on https://stackoverflow.com/a/62859169/562769 from typing import List, Tuple import fitz # install with ‘pip install pymupdf’ def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str: points = annot.vertices quad_count … Read more

What is the default list of stopwords used in Lucene’s StopFilter?

The default stop words set in StandardAnalyzer and EnglishAnalyzer is from StopAnalyzer.ENGLISH_STOP_WORDS_SET, as found in the source file: “a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”, “in”, “into”, “is”, “it”, “no”, “not”, “of”, “on”, “or”, “such”, “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”, “was”, “will”, “with” StopFilter itself defines no … Read more

Image retrieval system by Colour from the web using C++ with openframeworks

As I mentioned in my comment, it’s a matter of converting from RGB colourspace to Lab* colourspace and using the euclidean distance to the average colour of the image from the database. Here’s a basic demo: #include “testApp.h” //ported from http://cookbooks.adobe.com/post_Useful_color_equations__RGB_to_LAB_converter-14227.html struct Color{ float R,G,B,X,Y,Z,L,a,b; }; #define REF_X 95.047; // Observer= 2°, Illuminant= D65 #define … Read more

Python: tf-idf-cosine: to find document similarity

First off, if you want to extract count features and apply TF-IDF normalization and row-wise euclidean normalization you can do it in one operation with TfidfVectorizer: >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups() >>> tfidf = TfidfVectorizer().fit_transform(twenty.data) >>> tfidf <11314×130088 sparse matrix of type ‘<type ‘numpy.float64′>’ with 1787553 … Read more