I don’t know about bigmemory
, but to satisfy your challenges you don’t need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL
lines and randomly select N
lines, and then read that in.
Here’s an example using awk (assuming you want 100 random lines from a file that has 1M lines).
read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;}
!/NULL/{if (rand() < m/(length - NR + 1)) {
print; m--;
if (m == 0) exit;
}}\' filename'
)) -> df
It wasn’t obvious to me what you meant by NULL
, so I used literal understanding of it, but it should be easy to modify it to fit your needs.