Extract links from a web page

download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName(“a”) or in jsoup its even cool you can simply use File input = new File(“/tmp/input.html”); … Read more

C# regex pattern to extract urls from given string – not full html urls but bare links as well

You can write some pretty simple regular expressions to handle this, or go via more traditional string splitting + LINQ methodology. Regex var linkParser = new Regex(@”\b(?:https?://|www\.)\S+\b”, RegexOptions.Compiled | RegexOptions.IgnoreCase); var rawString = “house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue”; foreach(Match m in linkParser.Matches(rawString)) MessageBox.Show(m.Value); Explanation Pattern: \b -matches … Read more

How do you extract a column from a multi-dimensional array?

>>> import numpy as np >>> A = np.array([[1,2,3,4],[5,6,7,8]]) >>> A array([[1, 2, 3, 4], [5, 6, 7, 8]]) >>> A[:,2] # returns the third columm array([3, 7]) See also: “numpy.arange” and “reshape” to allocate memory Example: (Allocating a array with shaping of matrix (3×4)) nrows = 3 ncols = 4 my_array = numpy.arange(nrows*ncols, dtype=”double”) … Read more

Extracting text from PDFs in C# [closed]

There may be some difficulty in doing this reliably. The problem is that PDF is a presentation format which attaches importance to good typography. Suppose you just wanted to output a single word: Tap. A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code: moveto (x1, y); output (“T”) … Read more

Extract the text out of HTML string using JavaScript

Create an element, store the HTML in it, and get its textContent: function extractContent(s) { var span = document.createElement(‘span’); span.innerHTML = s; return span.textContent || span.innerText; }; alert(extractContent(“<p>Hello</p><a href=”http://w3c.org”>W3C</a>”)); Here’s a version that allows you to have spaces between nodes, although you’d probably want that for block-level elements only: function extractContent(s, space) { var span= … Read more