C#: How to make Sieve of Atkin incremental

Question

Here is a solution to unbounded prime sieving in C#, which can be implemented using the Sieve of Eratosthenes (SoE) or the Sieve of Atkin (SoA) algorithms; however, I maintain that it is hardly worth the extreme complexity of an optimized SoA solution given than the true SoE gives about the same performance without as much complexity. Thus, this is perhaps only a partial answer in that, while it shows how to implement a better SoA algorithm and shows how to implement an unbounded sequence using SoE, it only hints at how to combine these to write a reasonably efficient SoA.

Note that if only discussion on the very fastest implementations of these ideas is desired, jump to the bottom of this answer.

First we should comment on the point of this exercise in producing an unbounded sequence of primes to permit using the IEnumerable extension methods such as Take(), TakeWhile(), Where(), Count(), etcetera, as these methods give away some performance due to the added levels of method calling, adding at least 28 machine cycles for every call and return to enumerate one value and adding several levels of method calls per function; that said, having an (effectively) infinite sequence is still useful even if one uses more imperative filtering techniques on the results of the enumerations for better speed.

Note, in the the simpler examples following, I have limited the range to that of unsigned 32-bit numbers (uint) as much past that range the basic SoE or SoA implementations aren’t really appropriate as to efficiency and one needs to switch over to a “bucket” or other form of incremental sieve for part of the sieving for efficiency; however, for a fully optimized page segmented sieve as in the fastest implementation, the range is increased to the 64-bit range although as written one likely would not want to sieve past about a hundred trillion (ten to the fourteenth power) as run time would take up to hundreds of years for the full range.

As the question chooses the SoA for probably the wrong reasons in first mistaking a Trial Division (TD) prime sieve for a true SoE and then using a naive SoA over the TD sieve, let’s establish the true bounded SoE which can be implemented in a few lines as a method (which could be converted to a class as per the question’s implementation coding style) as follows:

static IEnumerable<uint> primesSoE(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  var BFLMT = (top_number - 3u) / 2u;
  var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 3u) / 2u;
  var buf = new BitArray((int)BFLMT + 1,true);
  for (var i = 0u; i <= BFLMT; ++i) if (buf[(int)i]) {
      var p = 3u + i + i; if (i <= SQRTLMT) {
        for (var j = (p * p - 3u) / 2u; j <= BFLMT; j += p)
          buf[(int)j] = false; } yield return p; } }

This calculates the primes to 2 million in about 16 milliseconds on an i7-2700K (3.5 GHz) and the 203,280,221 primes up to 4,294,967,291 in the 32-bit number range in about 67 seconds on the same machine (given a spare 256 MegaBytes of RAM memory).

Now, using the version above to compare to the SoA is hardly fair as the true SoA automatically ignores the multiples of 2, 3, and 5 so introducing wheel factorization to do the same for the SoE yields the following code:

static IEnumerable<uint> primesSoE(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  yield return 3u; if (top_number < 5u) yield break;
  yield return 5u; if (top_number < 7u) yield break;
  var BFLMT = (top_number - 7u) / 2u;
  var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 7u) / 2u;
  var buf = new BitArray((int)BFLMT + 1,true);
  byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 };
  for (uint i = 0u, w = 0u; i <= BFLMT; i += WHLPTRN[w], w = (w < 7u) ? ++w : 0u)
    if (buf[(int)i]) { var p = 7u + i + i; if (i <= SQRTLMT) {
        var pX2 = p + p; uint[] pa = { p, pX2, pX2 + p };
        for (uint j = (p * p - 7u) / 2u, m = w; j <= BFLMT;
                               j += pa[WHLPTRN[m] - 1u], m = (m < 7u) ? ++m : 0u)
          buf[(int)j] = false; } yield return p; } }

The above code calculates the primes to 2 million in about 10 milliseconds and the primes to the 32-bit number range in about 40 seconds on the same machine as above.

Next, let’s establish whether a version of the SoA that we are likely to write here actually has any benefit as compared to the SoE as per the above code as far as execution speed goes. The code below follows the model of the SoE above and optimizes the SoA pseudo-code from the Wikipedia article as to tuning the ranges of the ‘x’ and ‘y’ variables using separate loops for the individual quadratic solutions as suggested in that article, solving the quadratic equations (and the square free eliminations) only for odd solutions, combining the “3*x^2” quadratic to solve for both the positive and negative second terms in the same pass, and eliminating the computationally expensive modulo operations, to produce code that is over three times faster than the naive version of the pseudo-code posted there and as used in the question here. The code for the bounded SoA is as then follows:

static IEnumerable<uint> primesSoA(uint top_number) {
  if (top_number < 2u) yield break;
  yield return 2u; if (top_number < 3u) yield break;
  yield return 3u; if (top_number < 5u) yield break;
  var BFLMT = (top_number - 3u) / 2u; var buf = new BitArray((int)BFLMT + 1, false);
  var SQRT = (uint)(Math.Sqrt((double)top_number)); var SQRTLMT = (SQRT - 3u) / 2u;
  for (uint x = 1u, s = 1u, mdx12 = 5u, dmdx12 = 0u; s <= BFLMT; ++x, s += ((x << 1) - 1u) << 1) {
    for (uint y = 1u, n = s, md12 = mdx12, dmd12 = 8u; n <= BFLMT; y += 2, n += (y - 1u) << 1) {
      if ((md12 == 1) || (md12 == 5)) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8u; if (dmd12 >= 12u) dmd12 -= 12u; }
    mdx12 += dmdx12; if (mdx12 >= 12u) mdx12 -= 12u; dmdx12 += 8u; if (dmdx12 >= 12u) dmdx12 -= 12u; }
  for (uint x = 1u, s = 0u, mdx12 = 3u, dmdx12 = 8u; s <= BFLMT; ++x, s += x << 1) {
    int y = 1 - (int)x, n = (int)s, md12 = (int)mdx12, dmd12 = ((-y - 1) << 2) % 12;
    for (; (y < 0) && (uint)n <= BFLMT; y += 2, n += (-y + 1) << 1) {
      if (md12 == 11) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 4; if (dmd12 >= 12) dmd12 -= 12; }
    if (y < 1) { y = 2; n += 2; md12 += 4; dmd12 = 0; } else { n += 1; md12 += 2; dmd12 = 8; }
    if (md12 >= 12) md12 -= 12; for (; (uint)n <= BFLMT; y += 2, n += (y - 1) << 1) {
      if (md12 == 7) buf[(int)n] = buf[(int)n] ^ true;
      md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8; if (dmd12 >= 12) dmd12 -= 12; }
    mdx12 += dmdx12; if (mdx12 >= 12) mdx12 -= 12; dmdx12 += 4; if (dmdx12 >= 12) dmdx12 -= 12; }
  for (var i = 0u; i<=BFLMT; ++i) if (buf[(int)i]) { var p = 3u+i+i; if (i<=SQRTLMT) { var sqr = p*p;
        for (var j = (sqr - 3ul) / 2ul; j <= BFLMT; j += sqr) buf[(int)j] = false; } yield return p; } }

This is still over twice as slow as the wheel factorization SoE algorithm posted due to the not fully optimized number of operations. Further optimizations can be made to the SoA code as in using modulo 60 operations as for the original (non-pseudo-code) algorithm and using bit packing to only deal with multiples excluding 3 and 5 to get the code fairly close in performance to SoE or even exceed it slightly, but we get closer and closer to the complexity of the Berstein implementation to achieve this performance. My point is “Why SoA, when one works very hard to get about the same performance as SoE?”.

Now for the unbounded primes sequence, the very easiest way to do this is to define a const top_number of Uint32.MaxValue and eliminate the argument in the primesSoE or primesSoA methods. This is somewhat inefficient in that the culling is still done over the full number range no matter the actual prime value being processed, which makes the determination for small ranges of primes take much longer than it should. There are also other reasons to go to a segmented version of the primes sieve other than this and extreme memory use: First, algorithms that use data that is primarily within the CPU L1 or L2 data caches process faster because of more efficient memory access, and secondly because segmentation allows the job to be easily split into pages that can be culled in the background using multi-core processors for a speed-up that can be proportional to the number of cores used.

For efficiency, we should choose an array size of the CPU L1 or L2 cache size which is at least 16 Kilobytes for most modern CPU’s in order to avoid cache thrashing as we cull composites of primes from the array, meaning that the BitArray can have a size of eight times that large (8 bits per byte) or 128 Kilobits. Since we only need to sieve odd primes, this represents a range of numbers of over a quarter million, meaning that a segmented version will only use eight segments to sieve to 2 million as required by Euler Problem 10. One could save the results from the last segment and continue on from that point, but that would preclude adapting this code to the multi-core case where one relegates culling to the background on multiple threads to take full advantage of multi-core processors. The C# code for an (single thread) unbounded SoE is as follows:

static IEnumerable<uint> primesSoE() { yield return 2u; yield return 3u; yield return 5u;
  const uint L1CACHEPOW = 14u + 3u, L1CACHESZ = (1u << (int)L1CACHEPOW); //for 16K in bits...
  const uint BUFSZ = L1CACHESZ / 15u * 15u; //an even number of wheel rotations
  var buf = new BitArray((int)BUFSZ);
  const uint MAXNDX = (uint.MaxValue - 7u) / 2u; //need maximum for number range
  var SQRTNDX = ((uint)Math.Sqrt(uint.MaxValue) - 7u) / 2u;
  byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long
  byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index
  byte[] WHLNDX = { 0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, //get index from position
                    0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7 }; //allow for overflow
  byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, //allow for overflow...
                      15, 15, 17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27 };
  uint BPLMT = (ushort.MaxValue - 7u) / 2u; var bpbuf = new BitArray((int)BPLMT + 1, true);
  for (var i = 0; i <= 124; ++i) if (bpbuf[i]) { var p = 7 + i + i; //initialize baseprimes array
      for (var j = (p * p - 7) / 2; j <= BPLMT; j += p) bpbuf[j] = false; } var pa = new uint[3];
  for (uint i = 0u, w = 0, si = 0; i <= MAXNDX;
        i += WHLPTRN[w], si += WHLPTRN[w], si = (si >= BUFSZ) ? 0u : si, w = (w < 7u) ? ++w : 0u) {
    if (si == 0) { buf.SetAll(true);
      for (uint j = 0u, bw = 0u; j <= BPLMT; j += WHLPTRN[bw], bw = (bw < 7u) ? ++bw : 0u)
        if (bpbuf[(int)j]) { var p = 7u+j+j; var pX2=p+p; var k = p * (j + 3u) + j;
          if (k >= i + BUFSZ) break; pa[0] = p; pa[1] = pX2; pa[2] = pX2 + p; var sw = bw; if (k < i) {
            k = (i - k) % (15u * p); if (k != 0) { var os = WHLPOS[bw]; sw = os + ((k + p - 1u) / p);
              sw = WHLRNDUP[sw]; k = (sw - os) * p - k; sw = WHLNDX[sw]; } } else k -= i;
          for (; k<BUFSZ; k+=pa[WHLPTRN[sw]-1], sw=(sw<7u) ? ++sw : 0u) buf[(int)k]=false; } }
    if (buf[(int)si]) yield return 7u + i + i; } }

The above code takes about 16 milliseconds to sieve the primes up to two million and about 30 seconds to sieve the full 32-bit number range. This code is faster than the similar version using the factorial wheel without segmentation for large number ranges because, even though the code is more complex as to computations, a great deal of time is saved in not thrashing the CPU caches. Much of the complexity is in the computation of the new starting indexes per base prime per segment, which could have been reduced by saving the state of each per prime per segment but the above code can easily be converted so as to run the culling processes on separate background threads for a further four times speed up for a machine with four cores, even more with eight cores. Not using the BitArray class and addressing the individual bit locations by bit masks would provide a further speed up by about a factor of two, and further increases of the factorial wheel would provide about another reduction in time to about two thirds. Better packing of the bit array in eliminated indexes for values eliminated by the wheel factorization would slightly increase efficiency for large ranges, but would also add somewhat to the complexity of the bit manipulations. However, all of these SoE optimizations would not approach the extreme complexity of the Berstein SoA implementation, yet would run at about the same speed.

To convert the above code from SoE to SoA, we only need to change to the SoA culling code from the bounded case but with the modification that the starting addresses are recalculated for every page segment something like as the starting address is calculated for the SoE case, but even somewhat more complex in operation as the quadratics advance by squares of numbers rather than by simple multiples of primes. I’ll leave the required adaptations to the student, although I don’t really see the point given that the SoA with reasonable optimizations is not faster than the current version of the SoE and is considerably more complex 😉

EDIT_ADD:

Note: The below code has been corrected, as while the original posted code was correct as to the provided primes sequence, it was half again slower than it needed to be as it culled for all numbers below the square root of the range rather than only the found base primes up to the square root of the range.

The most effective and simplest optimization is relegating the culling operations per segment page to background threads so that, given enough CPU cores, the main limit in enumerating the primes is the enumeration loop itself, in which case all the primes in the 32-bit number range can be enumerated in about ten seconds on the above reference machine (in C#) without other optimizations, with all other optimizations including writing the IEnumerable interface implementations rather than using the built-in iterators reducing this to about six seconds for all the 203,280,221 primes in the 32-bit number range (1.65 seconds to one billion), again with most of the time spent just enumerating the results. The following code has many of those modifications including using background tasks from the Dotnet Framework 4 ThreadPool used by Tasks, using a state machine implemented as a look up table to implement further bit packing to make the primes enumeration quicker, and re-writing the algorithm as an enumerable class using “roll-your-own” iterators for increased efficiency:

class fastprimesSoE : IEnumerable<uint>, IEnumerable {
  struct procspc { public Task tsk; public uint[] buf; }
  struct wst { public byte msk; public byte mlt; public byte xtr; public byte nxt; }
  static readonly uint NUMPROCS = (uint)Environment.ProcessorCount + 1u; const uint CHNKSZ = 1u;
  const uint L1CACHEPOW = 14u, L1CACHESZ = (1u << (int)L1CACHEPOW), PGSZ = L1CACHESZ >> 2; //for 16K in bytes...
  const uint BUFSZ = CHNKSZ * PGSZ; //number of uints even number of caches in chunk
  const uint BUFSZBTS = 15u * BUFSZ << 2; //even in wheel rotations and uints (and chunks)
  static readonly byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long
  static readonly byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index
  static readonly byte[] WHLNDX = { 0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 6, 6, 7, 0, 0, 0 }; //get index from position
  static readonly byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, 15, 15, //allow for overflow...
                                      17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27, 30, 30, 30 }; //round multiplier up
  const uint BPLMT = (ushort.MaxValue - 7u) / 2u; const uint BPSZ = BPLMT / 60u + 1u;
  static readonly uint[] bpbuf = new uint[BPSZ]; static readonly wst[] WHLST = new wst[64];
  static void cullpg(uint i, uint[] b, int strt, int cnt) { Array.Clear(b, strt, cnt);
    for (uint j = 0u, wp = 0, bw = 0x31321212u, bi = 0u, v = 0xc0881000u, bm = 1u; j <= BPLMT;
      j += bw & 0xF, wp = wp < 12 ? wp += bw & 0xF : 0, bw = (bw > 3u) ? bw >>= 4 : 0x31321212u) {
      var p = 7u + j + j; var k = p * (j + 3u) + j; if (k >= (i + (uint)cnt * 60u)) break;
      if ((v & bm) == 0u) { if (k < i) { k = (i - k) % (15u * p); if (k != 0) {
            var sw = wp + ((k + p - 1u) / p); sw = WHLRNDUP[sw]; k = (sw - wp) * p - k; } }
        else k -= i; var pd = p / 15;
        for (uint l = k / 15u + (uint)strt * 4u, lw = ((uint)WHLNDX[wp] << 3) + WHLNDX[k % 15u];
               l < (uint)(strt + cnt) * 4u; ) { var st = WHLST[lw];
          b[l >> 2] |= (uint)st.msk << (int)((l & 3) << 3); l += st.mlt * pd + st.xtr; lw = st.nxt; } }
      if ((bm <<= 1) == 0u) { v = bpbuf[++bi]; bm = 1u; } } }
  static fastprimesSoE() {
    for (var x = 0; x < 8; ++x) { var p = 7 + 2 * WHLPOS[x]; var pr = p % 15;
      for (int y = 0, i = ((p * p - 7) / 2); y < 8; ++y) { var m = WHLPTRN[(x + y) % 8]; i %= 15;
        var n = WHLNDX[i]; i += m * pr; WHLST[x * 8 + n] = new wst { msk = (byte)(1 << n), mlt = m,
                                                                     xtr = (byte)(i / 15),
                                                                     nxt = (byte)(8 * x + WHLNDX[i % 15]) }; }
    } cullpg(0u, bpbuf, 0, bpbuf.Length);  } //init baseprimes
  class nmrtr : IEnumerator<uint>, IEnumerator, IDisposable {
    procspc[] ps = new procspc[NUMPROCS]; uint[] buf;
    Task dlycullpg(uint i, uint[] buf) {  return Task.Factory.StartNew(() => {
        for (var c = 0u; c < CHNKSZ; ++c) cullpg(i + c * PGSZ * 60, buf, (int)(c * PGSZ), (int)PGSZ); }); }
    public nmrtr() {
      for (var i = 0u; i < NUMPROCS; ++i) ps[i] = new procspc { buf = new uint[BUFSZ] };
      for (var i = 1u; i < NUMPROCS; ++i) { ps[i].tsk = dlycullpg((i - 1u) * BUFSZBTS, ps[i].buf); } buf = ps[0].buf;  }
    public uint Current { get { return this._curr; } } object IEnumerator.Current { get { return this._curr; } }
    uint _curr; int b = -4; uint i = 0, w = 0; uint v, msk = 0;
    public bool MoveNext() {
      if (b < 0) if (b == -1) { _curr = 7; b += (int)BUFSZ; }
        else { if (b++ == -4) this._curr = 2u; else this._curr = 7u + ((uint)b << 1); return true; }
      do {  i += w & 0xF; if ((w >>= 4) == 0) w = 0x31321212u; if ((this.msk <<= 1) == 0) {
          if (++b >= BUFSZ) { b = 0; for (var prc = 0; prc < NUMPROCS - 1; ++prc) ps[prc] = ps[prc + 1];
            ps[NUMPROCS - 1u].buf = buf; var low = i + (NUMPROCS - 1u) * BUFSZBTS;
            ps[NUMPROCS - 1u].tsk = dlycullpg(i + (NUMPROCS - 1u) * BUFSZBTS, buf);
            ps[0].tsk.Wait(); buf = ps[0].buf; } v = buf[b]; this.msk = 1; } }
      while ((v & msk) != 0u); if (_curr > (_curr = 7u + i + i)) return false; else return true;  }
    public void Reset() { throw new Exception("Primes enumeration reset not implemented!!!"); }
    public void Dispose() { }
  }
  public IEnumerator<uint> GetEnumerator() { return new nmrtr(); }
  IEnumerator IEnumerable.GetEnumerator() { return new nmrtr(); } }

Note that this code isn’t faster than the last version for small ranges of primes as in up to one or two million due to the overhead of setting up and initializing arrays but is considerably faster for larger ranges up to four billion plus. It is about 8 times faster than the question’s implementation of the Atkin sieve, but goes to from 20 to 50 times as fast for larger ranges up to four billion plus. There are defined constants in the code setting the base cache size and how many of these to cull per thread (CHNKSZ) that may slightly tweak the performance. Some slight further optimizations could be tried that might reduce the execution time for large primes by up to a factor of two such as further bit packing as for the 2,3,5,7 wheel rather than just the 2,3,5 wheel for a reduction of about 25% in packing (perhaps cutting the execution time to two thirds) and pre-culling the page segments by the wheel factorial up to the factor of 17 for a further reduction of about 20%, but these are about the extent of what can be done in pure C# code as compared to C code which can take advantage of unique native code optimizations.

At any rate, it is hardly worth further optimizations if one intends to use the IEnumberable interface for output as the question desires as about two thirds of the time is used just enumerating the found primes and only about one third of the time is spent in culling the composite numbers. A better approach would be to write methods on the class to implement the desired results such as CountTo, ElementAt, SumTo, etcetera so as to avoid enumeration entirely.

But I did do the extra optimizations as an additional answer which shows additional benefits in spite of additional complexity, and which further shows why one doesn’t want to use the SoA because a fully optimized SoE is actually better.

More Related Contents:

Leave a Comment Cancel reply