Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?

The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed already does this, so if the shell couldn’t do it, you could loop over the output from sed instead.

The proper way to loop over the lines in a file in the shell is with while read. There are a couple of complications — commonly, you reset IFS to avoid having the shell needlessly split the input into tokens, and you use read -r to avoid some pesky legacy behavior with backslashes in the original Bourne shell’s implementation of read, which have been retained for backward compatibility.

while IFS='' read -r lineN; do
    # do things with "$lineN"
done <"$1"

Besides being much simpler than your sed script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern disk driver, some repeated reading can be avoided by caching, but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read.

In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed (or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1) thousands of times in a tight loop will add significant overhead for any non-trivial input file. (On the other hand, if the body of your loop could be done in e.g. sed or Awk instead, that’s going to be a lot more efficient than a native shell while read loop, because of the way read is implemented. This is why while read is also frequently regarded as an antipattern.
And make sure you are reasonably familiar with the standard palette of Unix text processing tools – cut, paste, nl, pr, etc etc.)

The q in the sed script is a very partial remedy; frequently, you see variations where the sed script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.

With a small input file, the effects are negligible, but perpetuating this bad practice just because it’s not immediately harmful when the input file is small is just irresponsible. Just don’t teach this technique to beginners. At all.

If you really need to display the number of lines in the input file, at least make sure you don’t spend a lot of time seeking through to the end just to obtain that number. Maybe stat the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234 display something like line 1/approximately 10000000?) … or use an external tool like pv.

Tangentially, there is a vaguely related antipattern you want to avoid, too; you want to avoid reading an entire file into memory when you are only going to process one line at a time. Doing that in a for loop also has some additional gotchas, so don’t do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor

Counting lines or enumerating line numbers so I can loop over them – why is this an anti-pattern?

Leave a Comment Cancel reply

More Related Contents:

Leave a Comment Cancel reply