Implementing the Page Segmented Sieve of Eratosthenes in Javascript

Question

Although you have gathered from your reading that a Page Segmented Sieve of Eratosthenes is a fast way of finding the primes over a large range, your question code (even when corrected) does not implement a Page Segmented SoE, test the code over a large range, nor is it particularly fast as SoE implementations go. The following discussion will show how to progress toward using a true Page Segmented SoE over a large range.

Synopsis

Following is a staged progression of increasingly fast implementations leading to your intention, with commentary explaining the reasons and implementation details of every step. It includes runnable snippets in JavaScript, but these techniques are not limited to just JavaScript and other languages don’t place limits on some further refinements such as multi-threading of the resulting pages (other than by Web Workers, which are difficult to control as to order of processing), some further optimizations in extreme loop unrolling, and which last is related to the limited efficiency of the code due to having to be Just In Time (JIT) compiled to native code by the JavaScript engine in your browser; these limits are as compared to languages that compile directly to very efficient native code such as C/C++, Nim, Rust, Free Pascal, Haskell, Julia, etc.

Chapter 1 – Setting a Baseline

First, lets start with a working version of your current code algorithm used over a reasonably large range with timing information to establish a base line; the following code starts culling per prime at the square of the culling prime which avoids the problem of culling the given prime values and some redundant starting culls and there is no reason to generate an output array of resulting primes for the large range as we can produce the primes directly from the culling array; also, the determination of the answer is outside the timing because we will be developing better techniques to find an “answer” which for large ranges is usually the count of the number of primes found, the sum of the primes, the first occurrences of prime gaps, etc., none of which need to actually view the found primes:

"use strict";

function soePrimesTo(limit) {
  var sz = limit - 1;
  var cmpsts = new Uint8Array(sz); // index 0 represents 2; sz-1 is limit
  // no need to zero above composites array; zeroed on creation...
  for (var p = 2; ; ++p) {
    var sqr = p * p;
    if (sqr > limit) break; // while p is the square root of limit -> cull...
    if (cmpsts[p - 2] == 0 >>> 0) // 0/1 is false/true; false means prime...
      for (var c = sqr - 2; c < sz; c += p) // set true for composites...
          cmpsts[c] = 1 >>> 0; // use asm.js for some extra efficiency...
  }
  var bi = 0
  return function () {
    while (bi < sz && cmpsts[bi] != 0 >>> 0) ++bi;
    if (bi >= sz) return null;
    return bi++ + 2;
  };
}

// show it works...
var gen = soePrimesTo(100);
var p = gen();
var output = [];
while (p != null) { output.push(p); p = gen(); }
console.log("Primes to 100 are:  " + output + ".");

var n = 1000000000; // set the range here...
var elpsd = -new Date();
gen = soePrimesTo(n);
elpsd += +new Date();
var count = 0;
while (gen() != null) { count++; }
console.log("Found " + count + " primes up to " + n + " in " + elpsd + " milliseconds.");

Now, there is little purpose in quoting my run times for the above code sieving to a billion as your times will almost certainly be faster as I am using a very low end tablet CPU in the Intel x5-Z8350 running at 1.92 Gigahertz (CPU clock speed for a single active thread). I am going to quote my execution time exactly once as an example of how to calculate average CPU clock cycles per cull: I take the execution time for the above code of about 43350 milliseconds is 43.35 seconds times 1.92 billion clocks per second divided by about 2.514 billion culling operations to sieve to a billion (which can be calculated from the formula or from the table for no wheel factorization on the SoE Wikipedia page) to get about 33.1 CPU clock cycles per cull to a billion. From now on, we will only use CPU clock cycles per cull to compare performance.

Further note that these performance scores are very dependent on the browser JavaScript engine used, with the above score as run on Google Chrome (version 72); Microsoft Edge (version 44) is about seven times slower for the version of Page Segmented SoE towards which we are moving, and Firefox and Safari are likely to be close to Chrome in performance.

This performance is likely better than for the previous answers code due to using the Uint8Array TypedArray and more asm.js but the timing for such “one huge array sieves” (a GigaByte of memory used here) are subject to bottlenecks due to memory access speed for main RAM memory outside the CPU cache limits. This is why we are working toward a Page Segmented Sieve, but first let’s do something about reducing the amount of memory used and the number of required culling cycles.

Chapter 2 – Bit Packing and Odds Only Wheel Factorization

The following code does bit packing, which requires slightly more complexity in the tight inner culling loop, but as it uses just one bit per composite number considered, it reduces the memory use by a factor of eight; as well, since two is the only even prime, it uses basic wheel factorization to sieve odds only for a further reduction of memory use by a factor of two and a reduction in number of culling operations by a factor of about 2.5.

This minimal wheel factorization of odds-only works as follows:

We make a “wheel” with two positions on it, one half that “hits” on the even numbers, and the other half that hits on the odd numbers; thus this “wheel” has a circumferential span of two numbers as we “roll” it over all the potential prime numbers, but it only “windows” one out of two or half of the numbers over which it “rolls”.
We then divide all of numbers over which we “roll” into two bit planes of successive values, in one bit plane we place all of the evens starting at four, and in the other bit plane we place all of the odds starting at three.
Now we discard the even plane as none of the represented numbers there can be prime.
The starting index of p * p for considered odd base primes always is an odd number because multiplying an odd number by an odd number is always an odd number.
When we increment the index into the odd bit plane by the prime value, we are actually adding two times the value since adding an odd and an odd produces an even, which would be in the bit plane we discarded, so we add the prime value again to get back to the odd bit plane again.
The odd bit plane index positions automatically account for this since we have thrown away all the even values that were previously between each odd bit position index.
That is why we can cull by just repeatedly adding prime to the index in the following code.

"use strict";

function soeOddPrimesTo(limit) {
  var lmti = (limit - 3) >> 1; // bit index for limit value
  var sz = (lmti >> 3) + 1; // size in bytes
  var cmpsts = new Uint8Array(sz); // index 0 represents 3
  // no need to zero above composites array; zeroed on creation...
  for (var i = 0; ; ++i) {
    var p = i + i + 3; // the square index is (p * p - 3) / 2 but we
    var sqri = (i << 1) * (i + 3) + 3; // calculate start index directly 
    if (sqri > lmti) break; // while p is < square root of limit -> cull...
    // following does bit unpacking to test the prime bit...
    // 0/1 is false/true; false means prime...
    // use asm.js with the shifts to make uint8's for some extra efficiency...
    if ((cmpsts[i >> 3] & ((1 >>> 0) << (i & 7))) == 0 >>> 0)
      for (var c = sqri; c <= lmti; c += p) // set true for composites...
          cmpsts[c >> 3] |= (1 >>> 0) << (c & 7); // masking in the bit
  }
  var bi = -1
  return function () { // return function to return successive primes per call...
    if (bi < 0) { ++bi; return 2 } // the only even prime is a special case
    while (bi <= lmti && (cmpsts[bi >> 3] & ((1 >>> 0) << (bi & 7))) != 0 >>> 0) ++bi;
    if (bi > lmti) return null; // return null following the last prime
    return (bi++ << 1) + 3; // else calculate the prime from the index
  };
}

// show it works...
var gen = soeOddPrimesTo(100);
var p = gen();
var output = [];
while (p != null) { output.push(p); p = gen(); }
console.log("Primes to 100 are:  " + output + ".");

var n = 1000000000; // set the range here...
var elpsd = -new Date();
gen = soeOddPrimesTo(n);
elpsd += +new Date();
var count = 0;
while (gen() != null) { count++; }
console.log("Found " + count + " primes up to " + n + " in " + elpsd + " milliseconds.");

The performance is now about 34.75 CPU clock cycles per cull operation for 1.0257 billion culling operations for a range of a billion for odds only (from Wikipedia), meaning that the reduction in time is almost entirely due to the reduction in number of culling operations, with the extra complexity of the “bit-twiddling” bit packing only taking about the same amount of extra time as the savings due to the reduction in memory use by a factor of sixteen. Thus, this version uses one sixteenth the memory and is about 2.5 times faster.

But we aren’t done yet, page segmentation can speed us up even more just as your source told you.

Chapter 3 – Page Segmentation Including Faster Primes Counting

So what is Page Segmentation applied to the SoE and what does it do for us?

Page Segmentation is dividing the work of sieving from one huge array sieved at once to a series of smaller pages that are sieved successively. It then requires a little more complexity in that there must be a separate stream of base primes available, which can be obtained by sieving recursively with an inner sieve generating the list of memoized base primes to be used by the main sieve. As well, the generation of output results is generally a little more complex as it involves successive scanning and reduction for each of the generated sieved pages.

Page Segmentation has many benefits as follows:

It further reduces memory requirements. To sieve to a billion with the previous odds-only code takes about 64 MegaBytes of RAM, but couldn’t use that algorithm to sieve beyond about a range of 16 to 32 billion (even if we wanted to wait quite a long time) due to using all the memory available to JavaScript. By sieving by a page size of (say) a square root of that amount, we could sieve to a range of that amount squared or beyond anything we would have time for which to wait. We also have to store the base primes up to the square root of the desired range, but that is only a few tens of MegaBytes to any range we would care to consider.
It increases memory access speed which has a direct impact on the number of CPU clack cycles per culling operation. If culling operations mostly occur within the CPU caches, memory access changes from tens of CPU clock cycles per access for main RAM memory (depending on the CPU and the quality and type of RAM) to about one clock cycle for the CPU L1 cache (smaller at about 16 KiloByte’s) to about eight clock cycles from the CPU L2 cache (larger for at least about 128 KiloByte’s) and we can work out culling algorithms so that we use all the caches to their best advantage, using the small fast L1 cache for the majority of the operations with small base prime values, the larger little-bit-slower L2 cache for middle sized base primes, and only use main RAM access for the vary few operations using large base primes for extremely huge ranges.
It opens the possibility of multi-threading by assigning the work of sieving each largish page to a different CPU core for a fairly coarse grained concurrency, although this doesn’t apply to JavaScript other than through the use of Web Workers (messy).

Other than the added complexity, Page Segmentation has one other problem to work around: unlike the “one huge array” sieve where start indexes are calculated easily once and then used for the entire array, segmented sieves require either that the start addresses be calculated by modulo (division) operations per prime per page (computationally expensive), or additional memory needs to be used to store the index reached per base prime per page so the start index doesn’t have to be recalculated, but this last technique precludes multi-threading as these arrays get out of sync. The best solution that will be used in the Ultimate version is to use a combination of these techniques, where several page segments are grouped to form a reasonably large work unit for threading so that these address calculations take an acceptably small portion of the total time, and the index storage tables are used for the base primes across these larger work units per thread so that the complex calculations need only be done once per larger work unit. Thus we get both the possibility of multi-threading and a reduced overhead. However, the following code doesn’t reduce this overhead, which costs about 10% to 20% when sieving to a billion.

In addition to Page Segmentation, the following code adds an efficient counting of the found primes by using an Counting Look Up Table (CLUT) population counting algorithm that counts 32 bits at a time so that the overhead of continuously finding the result of the number of found primes a a small percentage of the sieving time. If this were not done, enumerating the individual found primes just to determine how many there are would take at least as long as it takes to sieve over a given range. Similar fast routines can easily be developed to do such things as sum the found primes, find prime gaps, etc.

START_EDIT:

The following code adds one other speed-up: for smaller primes (where this optimization is effective), the code does a form of loop separation by recognizing that the culling operations follow a eight step pattern. This is because a byte has an even number of bits and we are culling by odd primes that will return to the same bit position in a byte every eight culls; this means that for each of the bit positions, we can simplify the inner culling loop to mask a constant bit, greatly simplifying the inner loop and making culling up to about two times faster due to each culling loop in the pattern not needing to do the “bit-twiddling” bit packing operations. This change saves about 35% of the execution time sieving to a billion. It can be disabled by changing the 64 to 0. It also sets the stage for the native code extreme unrolling of the eight loop due to this pattern that can increase the culling operation speed by another factor of about two when native code compilers are used.

A further minor modification makes the loop faster for the larger primes (greater than 8192) by using a Look Up Table (LUT) for the mask value rather than the left shift operation to save about half a CPU clock cycle per culling operation on average when culling a range of a billion; this saving will be increased slightly as ranges go up from a billion but isn’t that effective in JavaScript and has been removed.

END_EDIT

ANOTHER_EDIT:

As well as the above edits, we have removed the LUT BITMASK but now zero the Sieve Buffer by fast byte copying from a zero buffer of the same size, and added a Counting LUT population count technique for about a 10% overall gain in speed.

END_ANOTHER_EDIT

FINAL_EDIT:

Made the sieve buffer size about the CPU L2 cache size instead of the L1, as the culling loop speed will never be limited by the L2 cache access speed. This results in a slight increase in speed and a 64 times increase in range while maintaining efficiency.

END_FINAL_EDIT

// JavaScript implementation of Page Segmented Sieve of Eratosthenes...
// This takes almost no memory as it is bit-packed and odds-only,
// and only uses memory proportional to the square root of the range;
// it is also quite fast for large ranges due to imrproved cache associativity...

"use strict";

const PGSZBYTES = 16384 * 8;
const PGSZBITS = PGSZBYTES * 8;
const ZEROSPTRN = new Uint8Array(PGSZBYTES);

function soePages(bitsz) {
  let len = bitsz >> 3;
  let bpa = [];
  let buf =  new Uint8Array(len);
  let lowi = 0;
  let gen;
  return function () {
let nxt = 3 + ((lowi + bitsz) << 1); // just beyond the current page
buf.set(ZEROSPTRN.subarray(0,buf.length)); // clear sieve array!
if (lowi <= 0 && bitsz < 131072) { // special culling for first page as no base primes yet:
  for (let i = 0, p = 3, sqr = 9; sqr < nxt; ++i, p += 2, sqr = p * p)
    if ((buf[i >> 3] & (1 << (i & 7))) === 0)
      for (let j = (sqr - 3) >> 1; j < 131072; j += p)
        buf[j >> 3] |= 1 << (j & 7);
} else { // other than the first "zeroth" page:
  if (!bpa.length) { // if this is the first page after the zero one:
    gen = basePrimes(); // initialize separate base primes stream:
    bpa.push(gen()); // keep the next prime (3 in this case)
  }
  // get enough base primes for the page range...
  for (let p = bpa[bpa.length - 1], sqr = p * p; sqr < nxt;
       p = gen(), bpa.push(p), sqr = p * p);
  for (let i = 0; i < bpa.length; ++i) { // for each base prime in the array
    let p = bpa[i] >>> 0;
    let s = (p * p - 3) >>> 1; // compute the start index of the prime squared
    if (s >= lowi) // adjust start index based on page lower limit...
      s -= lowi;
    else { // for the case where this isn't the first prime squared instance
      let r = (lowi - s) % p;
      s = (r != 0) ? p - r : 0;
    }
    if (p <= 32) {
      for (let slmt = Math.min(bitsz, s + (p << 3)); s < slmt; s += p) {
        let msk = ((1 >>> 0) << (s & 7)) >>> 0;
        for (let c = s >>> 3, clmt = bitsz >= 131072 ? len : len; c < clmt | 0; c += p)
          buf[c] |= msk;
      }
    }
    else
      // inner tight composite culling loop for given prime number across page
      for (let slmt = bitsz >= 131072 ? bitsz : bitsz; s < slmt; s += p)
        buf[s >> 3] |=  ((1 >>> 0) << (s & 7)) >>> 0;
  }
}
let olowi = lowi;
lowi += bitsz;
return [olowi, buf];
  };
}

function basePrimes() {
  var bi = 0;
  var lowi;
  var buf;
  var len;
  var gen = soePages(256);
  return function () {
while (true) {
  if (bi < 1) {
    var pg = gen();
    lowi = pg[0];
    buf = pg[1];
    len = buf.length << 3;
  }
  //find next marker still with prime status
  while (bi < len && buf[bi >> 3] & ((1 >>> 0) << (bi & 7))) bi++;
  if (bi < len) // within buffer: output computed prime
    return 3 + ((lowi + bi++) << 1);
  // beyond buffer range: advance buffer
  bi = 0;
  lowi += len; // and recursively loop to make a new page buffer
}
  };
}

const CLUT = function () {
  let arr = new Uint8Array(65536);
  for (let i = 0; i < 65536; ++i) {
let nmbts = 0|0; let v = i;
while (v > 0) { ++nmbts; v &= (v - 1)|0; }
arr[i] = nmbts|0;
  }
  return arr;
}();

function countPage(bitlmt, sb) {
  let lst = bitlmt >> 5;
  let pg = new Uint32Array(sb.buffer);
  let cnt = (lst << 5) + 32;
  for (let i = 0 | 0; i < lst; ++i) {
let v = pg[i]; cnt -= CLUT[v & 0xFFFF]; cnt -= CLUT[v >>> 16];
  }
  var n = pg[lst] | (0xFFFFFFFE << (bitlmt & 31));
  cnt -= CLUT[n & 0xFFFF]; cnt -= CLUT[n >>> 16];
  return cnt;
}

function countSoEPrimesTo(limit) {
  if (limit < 3) {
if (limit < 2) return 0;
return 1;
  }
  var cnt = 1;
  var lmti = (limit - 3) >>> 1;
  var lowi;
  var buf;
  var len;
  var nxti;
  var gen = soePages(PGSZBITS);
  while (true) {
var pg = gen();
lowi = pg[0];
buf = pg[1];
len = buf.length << 3;
nxti = lowi + len;
if (nxti > lmti) {
  cnt += countPage(lmti - lowi, buf);
  break;
}
cnt += countPage(len - 1, buf);
  }
  return cnt;
}

var limit = 1000000000; // sieve to this limit...
var start = +new Date();
var answr = countSoEPrimesTo(limit);
var elpsd = +new Date() - start;
console.log("Found " + answr + " primes up to " + limit + " in " + elpsd + " milliseconds.");

As implemented here, that code takes about 13.0 CPU clock cycles per cull sieving to a billion. The extra about 20% saving by improving the address calculation algorithm is automatically improved when using the following maximum wheel factorization algorithm due to the effective page sized increasing by a factor of 105 so that this overhead becomes only a few percent and comparable to the few percent used for array filling and result counting.

Chapter 4 – Further Advanced Work to add Maximal Wheel Factorization

Now we consider the more extensive changes to use maximum wheel factorization (by not only 2 for “odds-only”, but also 3, 5, and 7 for a wheel that covers a span of 210 potential primes instead of a span of just 2) and also pre-culling on initialization of the small sieving arrays so that it is not necessary to cull by the following primes of 11, 13, 17, and 19. This reduces the number of composite number culling operations when using the page segmented sieve by a factor of about four to a range of a billion (as shown in the tables/calculated from the formulas in the Wikipedia article – “combo wheel”) and can be written so that it runs about four times as fast due to the reduced operations with each culling operation about the same speed as for the above code.

The way of doing the 210-span wheel factorization efficiently is to follow on the “odds-only” method: the current algorithm above can be thought of as sieving one bit-packed plane out of two as explained in the previous chapter where the other plane can be eliminated as it contains only the even numbers above two; for the 210-span we can define 48 bit-packed arrays of this size representing possible primes of 11 and above where all the other 162 planes contain numbers which are factors of two, three, five, or seven and thus don’t need to be considered. Each bit plane can than be culled just by repeated indexing in increments by the base prime just as the odd number plane was done with the multiplications automatically being taken care of by the structure just as for odds only. In this way it is just as efficient to sieve with less memory requirements (by 48/210 as compared to “odds-only” at 1/2) and with as much efficiency as for odds only, where one 48-plane “page” represents for a 16 Kilobytes = 131072 bits per plane size times 210 is a range of 27,525,120 numbers per sieve page segment, thus only almost 40 page segments to sieve to a billion (instead of almost four thousand as above), and therefore less overhead in start address calculation per base prime per page segment for a further gain in efficiency.

Although the extended code described above is a couple of hundred lines and long to post here, it can count the number of primes to a billion in about two seconds on my low end Intel 1.92 Gigahertz CPU using the Google V8 JavaScript engine, which is about five times slower than the same algorithm run in native code.

Although the above code is quite efficient up to a range of about 16 billion, other improvements can help maintain the efficiency to even larger ranges of several tens of thousands of billions such as 1e14 or more. We accomplish this by adjusting the effective page sizes upward so that they are never smaller than the square root of the full range being sieved, yet sieve them incrementally by 16 KiloByte chunks for small primes, 128 KiloByte chunks for medium primes, and only sieve the a huge array as per our baseline implementation for the very few culling operations used for the largest base prime sizes. In this way, our clocks per cull doesn’t increase more than a small factor of up to about two for the largest range we would likely consider.

As this answer is close to the limited size of 30,000 characters, further discussion on Maximal Wheel Factorization is continued on my followup Chapter 4.5a and (future) Chapter 4.5b answers for actual implementations of the techniques described above.

Chapter 5 – What JavaScript (and other VM Languages) Can’t Do

For JavaScript and other Virtual Machine languages, the minimum culling loop time is in the order of ten CPU cycles per culling loop and that is not likely to change much. This is about three to four times slower than the about three CPU clock cycles that can easily be achieved with languages that compile directly to efficient machine code using the same algorithm such as C/C++, Nim, Rust, Free Pascal, Haskell, Julia, etc.

In addition, there are extreme loop unrolling techniques that can be used with at least some of those languages that can produce a reduction in average culling operation cycles by a further factor of about two that is denied to JavaScript.

Multi-threading can reduce the execution time by about the factor of effective CPU cores used, but with JavaScript the only way to get this is though the use of Web Workers and that is messy to synchronize. On my machine, I have four cores but only gain a factor of three in speed due the the CPU clock rate being reduced to three quarters with all cores active; this factor of three is not easy to gain for JavaScript.

So this is about the state-of-the-art using JavaScript and other current VM languages have about the same limitation other than they can easily use multi-threading, with the combination of the above factors meaning native code compilers can be about twenty times faster than JavaScript (including multi-threading, and even more with new CPU’s with huge numbers of cores).

However, I believe that the future of Web programming in three to five years will be Web Assembly, and that has the potential to overcome all of these limitations. It is very near to supporting multi-threading now, and although currently only about 30% faster than JavaScript for this algorithm on Chrome, it is up to only a little slower than native code on some current browsers when compiled from some languages using some Web Assembly compilers. It is still early days of development for efficient compilers to Web Assembly and for efficient browser compilation to native code, but as Web Assembly is closer to native code than most VM’s, it could easily be improved to produce native code that is as fast or nearly as fast as code from those other notive-code compiling languages.

However, other than for compilation of JavaScript libraries and Frameworks into Web Assembly, I don’t believe the future of the Web will be JavaScript to Web Assembly compilers, but rather compilation from some other language. My favourite picks of the future of Web programming is F# with perhaps the Fable implementation converted to produce Web Assembly rather than JavaScript (asm.js) or Nim. There is even some possibility that Web Assembly could be produced that supports and shows the advantage of the extreme loop unrolling for very close to the fastest of known Page Segmented SoE speeds.

Conclusion

We have built a Page Segmented Sieve of Eratosthenes in JavaScript that is suitable for sieving large ranges in the billions, and have a means to further extend this work. The resulting code has about ten times less culling operations (when fully wheel factorized) and the culling operations are about three times faster meaning the code per given (large) range is about 30 times faster while the reduced memory use means that one can sieve to ranges of the 53 bit mantissa of about 9e15 in something of the order of a year (just leave a Chrome browser tab open and back up the power supply).

Although there are some other minor tweaks possible, this is about the state of the art in sieving primes using JavaScript: while it isn’t a fast as native code for the given reasons, it is fast enough to find the number of primes to 1e14 in a day or two (even on a mid range smartphone) by leaving a browser tab open for the required computation time; this es quite amazing as the number of primes in this range wasn’t known until 1985 and then by using numerical analysis techniques and not by using a sieve as the computers of that day weren’t fast enough using the fastest coding techniques to do it in a reasonable and economic amount of time. Although we can do this just in just a few hours using these algorithms for the best of the native code compilers on modern desktop computers, now we can do it in an acceptable time with JavaScript on a smart phone!