Read random lines from huge CSV file

import random

filesize = 1500                 #size of the really big file
offset = random.randrange(filesize)

f = open('really_big_file')
f.seek(offset)                  #go to random position
f.readline()                    # discard - bound to be partial line
random_line = f.readline()      # bingo!

# extra to handle last/first line edge cases
if len(random_line) == 0:       # we have hit the end
    f.seek(0)
    random_line = f.readline()  # so we'll grab the first line instead

As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:

Let’s assume (in this case) we have min=3 and max=15

1) Find the length (Lp) of the previous line.

Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time
If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.

We accomplish this by randomly keeping the line X% of the time where:

X = min / Lp

If we don’t keep the line, we do another random pick until our dice roll comes good. 🙂

More Related Contents:

Leave a Comment Cancel reply