plyr - w3toppers.com

R: Split unbalanced list in data.frame column

#Split by ; as before allJobs <- strsplit(df$b, “;”, fixed=TRUE) #Replicate a by the number of jobs in each case n <- sapply(allJobs, length) id <- rep(df$a, times = n) #Turn allJobs into a vector job <- unlist(allJobs) #Retrieve position of each job jobNum <- unlist(lapply(n, seq_len)) #Combine into a data frame df2 <- data.frame(id … Read more

Split Data Frame into Rows of Fixed Size

I don’t understand why a plyr solution is needed. split works perfectly well and even hadley himself didn’t suggest a plyr/reshape2 solution when he looked at the earlier question: split(dfrm, (0:nrow(dfrm) %/% 300) # modulo division Does produce a warning but since you were expecting a non-evenly divisible result you should ignore it.

Returning first row of group

By reproducing the example data frame and testing it I found a way of getting the needed result: Order data by relevant columns (ID, Start) ordered_data <- data[order(data$ID, data$Start),] Find the first row for each new ID final <- ordered_data[!duplicated(ordered_data$ID),]

How to get top n companies from a data frame in decreasing order

head and tail are really useful functions! head(sort(Forbes2000$profits,decreasing=TRUE), n = 50) If you want the first 50 rows of the data.frame, then you can use the arrange function from plyr to sort the data.frame and then use head library(plyr) head(arrange(Forbes2000,desc(profits)), n = 50) Notice that I wrapped profits in a call to desc which means … Read more

Object not found error with ddply inside a function

Today’s solution to this question is to make summarize into here(summarize). e.g. myFunction <- function(x, y){ NewColName = “a” z = ddply(x, y, here(summarize), Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE) ) return(z) } here(f), added to plyr in Dec 2012, captures the current context.

R: speeding up “group by” operations

Instead of the normal R data frame, you can use a immutable data frame which returns pointers to the original when you subset and can be much faster: idf <- idata.frame(myDF) system.time(aggregateDF <- ddply(idf, c(“year”, “state”, “group1”, “group2”), function(df) wtd.mean(df$myFact, weights=df$weights))) # user system elapsed # 18.032 0.416 19.250 If I was to write a … Read more

Joining aggregated values back to the original data frame [duplicate]

One line of code does the trick: new <- ddply( df, “group1”, transform, numcolwise(mean)) new group1 group2 values meanValue 1 1 A 0.48742905 -0.121033381 2 1 A -0.04493361 -0.121033381 3 1 C -0.62124058 -0.121033381 4 1 C -0.30538839 -0.121033381 5 2 A 1.51178117 0.004803931 6 2 B 0.73832471 0.004803931 7 2 A -0.01619026 0.004803931 8 … Read more

Why is plyr so slow?

Why it is so slow? A little research located a mail group posting from a Aug. 2011 where @hadley, the package author, states This is a drawback of the way that ddply always works with data frames. It will be a bit faster if you use summarise instead of data.frame (because data.frame is very slow), … Read more

Create columns from factors and count [duplicate]

You only need to make some slight modification to your code. You should use .(Name) instead of c(“Name”): ddply(df1, .(Name), summarise, Score_1 = sum(Score == 1), Score_2 = sum(Score == 2), Score_3 = sum(Score == 3)) gives: Name Score_1 Score_2 Score_3 1 Ben 1 1 0 2 John 1 1 1 Other possibilities include: 1. … Read more

Change value of variable with dplyr

We can use replace to change the values in ‘mpg’ to NA that corresponds to cyl==4. mtcars %>% mutate(mpg=replace(mpg, cyl==4, NA)) %>% as.data.frame()