Remove/collapse consecutive duplicate values in sequence

One easy way is to use rle:

Here’s your sample data:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items

rle returns a list with two values: the run length (“lengths“), and the value that is repeated for that run (“values“).

rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Update: For a data.frame

If you are working with a data.frame, try something like the following:

## Sample data
mydf <- data.frame(
  V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
  V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10)
)

## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1]  1  4  5  7  8  9 11 13 15
mydf[Y, ]
#    V1 V2
# 1   a  1
# 4   b  2
# 5   c  4
# 7   d  3
# 8   e  9
# 9   a  4
# 11  b 10
# 13  e  2
# 15  d  4

Update 2

The “data.table” package has a function rleid that lets you do this quite easily. Using mydf from above, try:

library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
#    rleid V2
# 1:     1  1
# 2:     2  2
# 3:     3  4
# 4:     4  3
# 5:     5  9
# 6:     6  4
# 7:     7 10
# 8:     8  2
# 9:     9  4

Leave a Comment