Here are three possible solutions.
library(stringi)
library(stringdist)
a <- "hello"
b <- "hel123l5678o"
## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"
There are also adist()
and agrep()
in base R, and the stringdist
package has a few functions that run the LCS method. Here’s a look at stringsidt
. It returns the number of unpaired characters.
stringdist(a, b, method="lcs")
# [1] 7
Filter("!", mapply(
stringdist,
stri_sub(b, 1, 1:nchar(b)),
stri_sub(a, 1, 1:nchar(b)),
MoreArgs = list(method = "lcs")
))
# h he hel
# 0 0 0
Now that I’ve explored this a bit more, I think adist()
might be the way to go. If we set counts=TRUE
we get a sequence of Matches, Insertions, etc. So if you give that to stri_locate()
we can use that matrix to get the matches from a to b.
ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"
So the M
values denote straight across matches. We can go and get the substrings with stri_sub()
stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l" "o"
Sorry I haven’t explained that very well as I’m not well versed in string distance algorithms.