The Sieve of Atkin

The Sieve of Atkin pseudo code from the Wikipedia article you’ve quoted contains the answers to your questions or the discussion about the article for which Will Ness has provided a link does, although you may not be able to put the information together. Short answers are as follows:

  1. The three equations come from Atkin’s mathematical proof that all
    primes will occur as the solutions of one or more of those three
    equations with the appropriate modulo conditions for all valid
    values of the variables ‘x’ and ‘y’, and in fact he further proved
    that true primes will be those numbers that have an odd number of
    solutions to those three equations (left as True when toggled an odd
    number of times from initial conditions of False) with the modulo
    conditions for each excluding those numbers which are divisible by
    squares of found primes less than or equal to the square root of the
    tested number.

  2. The true Sieve of Atkin is based on a set of modulo 60 conditions;
    this pseudo code represents a simplification where there are less
    ranges of conditions for each equation using a set of modulo 12
    conditions (5 × 12 = 60) – however, this results in there being 20%
    extra work being done because of introduced redundancy in this new
    set of conditions. It is also the reason that this simplified
    pseudo code needs to start its prime scan at 5 rather than the
    normal 7 and to do the base primes squares free eliminations
    starting at a base prime of 5 at an added cost in execution time
    as the factors of 5 are not otherwise handled properly. The reason
    for this simplification is perhaps to sacrifice some extra code
    overhead in order to ease the code complexity in having to check
    values, although this can be done with a single table look up
    whereas the extra over 30% in work due to using modulo 12 can not
    be reduced.

  3. The “reminders” (should be spelled remainders) are found by the
    ‘mod’ operators in the pseudo code you quoted which stand for the modulo
    operator in whatever computer language one might use, often
    represented by the “%” symbol in computer languages such as C, Java,
    C#, F#, et cetera. This operator yields the integer remainder
    after an integer division of the dividend (the first of the numbers,
    before the ‘mod’) by the divisor (the second of the numbers, after
    the ‘mod’ symbol). The reason that the remainders are just four
    rather than the 16 as used in the full modulo 60 algorithm is due to
    the simplifications of reducing to a modulo 12 algorithm. You’ll notice
    that with the modulo 12 conditions, the “4x” condition passes 25
    which would normally be eliminated by the modulo 60 conditions, thus many composites containing 25 as a factor need to be
    eliminated by the additional prime of 5 square free check. Similarly, 55
    passes the “3x+” check and 35 passes the “3x-” check where they
    wouldn’t for the full modulo 60 algorithm.

As discussed in the Wikipedia article “Talk” section and hinted above, this quoted pseudo code is never much faster than even basic odds-only Sieve of Eratosthenes (SoE) implementations let alone one that uses the same degree of wheel factorization due to its inefficiencies: the variables ‘x’ and ‘y’ do not need to range over the full range to the square root of the range sieved for many cases (mentioned in the article), the proper use of the modulo 60 wheel recovers the redundancies in the use of the modulo 12 simplification (as I mention above), and there are modulo lattice patterns in the solutions to the quadratic equations such that the conditions using the (computationally slow) modulo operations need not be tested by the use of an algorithm that automatically skips those solutions that would not satisfy those modulo conditions according to the lattice patterns (only mentioned very obscurely in the full Atkin and Bernstein paper).

To answer a question you didn’t ask but should have: “Why use the Sieve of Atkin?”.

The main reason that the Sieve of Atkin (SoA) is implemented rather than the Sieve of Eratosthenes (SoE) is that it is “Internet Knowledge” that the SOA is faster. There are two reasons for this belief, as follows:

  1. The SoA is assumed to be faster because the asymptotic computational
    complexity is less for it than for the SoE by a factor of log(log n)
    where n is the range of primes sieved. Practically speaking, going
    from a range of two to the power 32 (four billion plus) to two to
    the power 64 (about 2 followed by 19 zeros), this factor is six over
    five equals 1.2. That means that the true SoE should take 1.2 times
    as long as expected by just a linear ratio when sieving to the
    64-bit number range as compared to the 32-bit number range, whereas
    the SoA will have a linear relationship if all were ideal. You
    can appreciate that, first, this is a very small factor for such a
    huge difference in number ranges and, second, that this benefit only
    holds if the two algorithms had the same or close to the same
    performance at some reasonable range of primes.

    What isn’t clearly understood by that “Internet knowledge” is that
    these figures only apply when one is comparing the ratio of
    performance over a given range as compared to another given range
    for the same algorithm, not as a comparison between different
    algorithms. Thus it is useless as a proof that the SoA will be
    faster than the SoE since the SoA could start with a larger overhead
    for a given sieve range of a particular SoE algorithm as in the
    following optimized SoE example.

  2. The SoA is believed to be faster because of the computational
    comparison made and published by Atkin and Bernstein as per their
    paper linked in the Wikipedia article. While the work is accurate,
    it only applies to the artificial limitations that they made on the
    SoE algorithm they compared: Since the SoA algorithm is based on
    modulo 60 factorization which represents two 2,3,5 factorization
    wheel rotations, they also limited the SoE algorithm to that same
    wheel factorization. Doing this, the SoE does about 424,000
    composite number cull operations over the one billion range tested,
    whereas a fully optimized SoA as tested has about 326,000 combined
    toggle and square free cull operations, which each take about the
    same time as each composite number cull operation in the SoE due to
    being written in the same style. This means that the SoA is about
    30% faster than the SoE for this particular set of wheel factorization
    conditions
    , and that is just about exactly what the Atkins and Bernstein
    comparison test showed.

    However, the SoA does not respond to further levels of wheel
    factorization as the 2,3,5 level is “baked-in” to the algorithm,
    whereas the SoE does respond to further levels: using a 2,3,5,7
    wheel factorization results in about the same number of operations
    meaning about the same performance for each. One can use even a
    partial higher level of wheel factorization than the 2,3,5,7 level
    to get the number of operations for SoE about 16.7% less than SoA,
    for a proportional better performance. The optimizations required
    to implement these extra levels of wheel factorization are actually
    simpler than the code complexity to implement the original optimized
    SoA. The memory footprint for implementing these comparable page
    segmented implementations is about the same, being the size of the
    page buffers plus the base primes array.

    In addition, both would benefit from being written in a “machine
    state look up” style which would help in better optimization using
    inlined code and extreme loop unrolling but the SoE befits more
    from this style than the SoA due to being a simpler algorithm.

So for sieve ranges up to about the 32-bit number range, the maximally optimized SoE is about 20% faster (or more with further wheel factorization) than the SoA; however, the SoA has this asymptotic computational complexity advantage, so there will be some point at which it catches up. That point will be at about the range where the ratio of log (log n) to log (log (2^32)) or 5 is 1.2 or a range of about 2 times ten to the nineteenth power – an exceedingly large number. If the optimized SoA run on a modern desktop computer were to take about a third of a second to compute the primes in the 32-bit number range, and if the implementation were ideal as in taking linear time increase with increasing range, it would take about 45 years to compute the primes to this range. However, analysis of prime numbers in these high ranges is often done in small chunks, for which the use of the SoA would be theoretically practical as compared to the SoE for very large number sieves, but by a very small factor.

EDIT_ADD: In actual fact, neither the page segmented SoE nor the SoA continue to run in linear time for these huge ranges as compared to the lower ranges as both run into problems with the “toggling” and “culling” operations losing efficiency due to having to skip large numbers of pages per jump; this means that both will require modified algorithms to handle this “page jumping” and the very slight advantage to the SoA may be completely erased if there are any slight differences in the way that this is done. The SoA has many more terms in the “jump tables” than the SoE by about the inverse ratio between of the number of primes found up to the square root of the range to that range, and this will likely add a O(log n) term to both in processing but a constant factor larger increase for the SoA which has a higher number of entries in the “jump table”. This extra fact will pretty much completely cancel out any advantage of the SoA over the SoE, even for extremely large ranges. Further, the SoA has a constant overhead of more individual loops required for the many more cases in implementing the conditions for the three separate quadratic equations plus the “prime squares free” loop instead of just a primes culling loop so can never have as low an average computational time per operation as the SoE when fully optimized. END_EDIT_ADD

EDIT_ADD2: It is my opinion that much of the confusion about the Sieve of Atkin is due to the deficiencies of the pseudo code from the Wikipedia article as quoted in the question, so have come up with a somewhat better version of the pseudo code that addresses at least some of the deficiencies to do with the range of the ‘x’ and ‘y’ variables and the modulo 12 versus modulo 60 confusion as follows:

limit ← 1000000000        // arbitrary search limit

// Initialize the sieve
for i in {7,11,13,17,19,23,29,31, 37,41,43,47,49,53,59,61,...}:
    is_prime(i) ← false

// Put in candidate primes:
// integers which have an odd number of
// representations by certain quadratic forms.
while n ≤ limit, n ← 4x²+y² where x ∈ {1,2,...} and y ∈ {1,3,...} // odd y's
    if n mod 60 ∈ {1,13,17,29,37,41,49,53}:
        is_prime(n) ← ¬is_prime(n)
while n ≤ limit, n ← 3x²+y² where x ∈ {1,3,...} and y ∈ {2,4,...} // only odd 
    if n mod 60 ∈ {7,19,31,43}:                            // x's and even y's
        is_prime(n) ← ¬is_prime(n)
while n ≤ limit, n ← 3x²-y² where x ∈ {2,3,...} and y ∈ {x-1,x-3,...,1} //all 
    if n mod 60 ∈ {11,23,47,59}:                   // even/odd odd/even combos
        is_prime(n) ← ¬is_prime(n)

// Eliminate composites by sieving, only for those occurrences on the wheel
for n² ≤ limit where n ∈ {7,11,13,17,19,23,29,31, 37,41,43,47,49,53,59,61,...}:
    if is_prime(n):
        // n is prime, omit multiples of its square; this is
        // sufficient because composites which managed to get
        // on the list cannot be square-free
        while c ≤ limit, c ← n² × k where
                      k ∈ {1,7,11,13,17,19,23,29, 31,37,41,43,47,49,53,59,...}:
            is_prime(c) ← false

output 2, 3, 5
for n ≤ limit when n ← 60 × k + x where
  k ∈ {0..} and
  x ∈ {7,11,13,17,19,23,29,31, 37,41,43,47,49,53,59,61}:
    if is_prime(n): output n

The above seems quite simple and is quite a good algorithm except that it still isn’t faster than a basic Sieve of Eratosthenes that uses the same 2;3;5 factorization wheel because it wastes almost half of its inner loop toggle operations that fail the modulo tests. To demonstrate:

Following is the repeating 4x²+y² pattern of qualifying modulo 60’s across a range of 15 values of ‘x’ (every value) vertically and 15 odd values of ‘y’ horizontally; both starting at one. Notice that they are symmetrical about a vertical axis but only 128 out of 225 or 256 out of 450 for a full 30 number range of ‘x’ values are valid toggle operations:

  0 13 29 53  0  0 53 49 53  0  0 53 29 13  0
 17  0 41  0 37 17  0  1  0 17 37  0 41  0 17
 37  0  1  0  0 37  0  0  0 37  0  0  1  0 37
  0 13 29 53  0  0 53 49 53  0  0 53 29 13  0
 41 49  0 29  1 41 29  0 29 41  1 29  0 49 41
  0  0 49 13  0  0 13  0 13  0  0 13 49  0  0
 17  0 41  0 37 17  0  1  0 17 37  0 41  0 17
 17  0 41  0 37 17  0  1  0 17 37  0 41  0 17
  0  0 49 13  0  0 13  0 13  0  0 13 49  0  0
 41 49  0 29  1 41 29  0 29 41  1 29  0 49 41
  0 13 29 53  0  0 53 49 53  0  0 53 29 13  0
 37  0  1  0  0 37  0  0  0 37  0  0  1  0 37
 17  0 41  0 37 17  0  1  0 17 37  0 41  0 17
  0 13 29 53  0  0 53 49 53  0  0 53 29 13  0
  1  0  0 49  0  1 49  0 49  1  0 49  0  0  1

Following is the repeating 3x²+y² pattern of qualifying modulo 60’s across a range of 5 odd values of ‘x’ vertically and 15 even values of ‘y’ horizontally. Notice that they are symmetrical about a horizontal axis but only 48 out of 75 or 144 out of 450 for a full 30 number range of ‘x’ values are valid toggle operations as all 144 out of 450 invalid operations with even ‘x’ and odd ‘y’ have already been eliminated:

  7 19  0  7 43  0 19 19  0 43  7  0 19  7  0
 31 43  0 31  7  0 43 43  0  7 31  0 43 31  0
 19 31  0 19  0  0 31 31  0  0 19  0 31 19  0
 31 43  0 31  7  0 43 43  0  7 31  0 43 31  0
  7 19  0  7 43  0 19 19  0 43  7  0 19  7  0

Following is the repeating 3x²-y² pattern of qualifying modulo 60’s across a range of 5 odd values of ‘x’ vertically and 15 even values of ‘y’ horizontally. Notice that they are symmetrical about a horizontal axis but only 48 out of 75 or 224 out of 450 for a full 30 number range of ‘x’ values are valid toggle operations:

 59 47  0 59 23  0 47 47  0 23 59  0 47 59  0
 23 11  0 23 47  0 11 11  0 47 23  0 11 23  0
 11 59  0 11  0  0 59 59  0  0 11  0 59 11  0
 23 11  0 23 47  0 11 11  0 47 23  0 11 23  0
 59 47  0 59 23  0 47 47  0 23 59  0 47 59  0

Following is the repeating 3x²-y² pattern of qualifying modulo 60’s across a range of 5 even values of ‘x’ vertically and 15 odd values of ‘y’ horizontally. Notice that they are symmetrical about a vertical axis but only 48 out of 75 or 224 out of 450 for a full 30 number range of ‘x’ values are valid toggle operations:

 11  0 47 23  0 11 23  0 23 11  0 23 47  0 11
 47  0 23 59  0 47 59  0 59 47  0 59 23  0 47
 47  0 23 59  0 47 59  0 59 47  0 59 23  0 47
 11  0 47 23  0 11 23  0 23 11  0 23 47  0 11
 59  0  0 11  0 59 11  0 11 59  0 11  0  0 59

As one can determine by inspection of the above tables, for the pseudo code as above there is an overall repeating range of 30 values of x (including both evens and odds) which only has 688 valid operations out of a total of 1125 combined operations; thus it wastes almost half of its processing in just conditionally skipping over values due to the non-productive skipping operations being part of the innermost loops.There are two possible methods to eliminate that “hit” redundancy inefficiently, as follows:

  1. The method as outlined in the Atkin and Bernstein paper, which uses the recognized patterns in the combined modulos of ‘x’ and ‘y’ to process only the modulo “hits” using the fact that once one locates a given modulo on the pattern, then there are an infinite sequence of values at that some modulo (equals wheel bit position) where each pattern is separated by an easily calculated (variable) offset which has the form “current position plus A times the current value of ‘x’ plus B” and “current position plus C times the current value of ‘y’ plus D” where A, B, C, and D, are simple constants (simple meaning small easily manipulated). Bernstein used that method with the additional help of many Look Up Tables (LUTs).

  2. The method of creating overall pattern state Look Up Tables (LUTs) (one for each of the four tables above plus one for the minor prime square free loop) indexed by the modulo current values of ‘x’ combined with the modulo of ‘y’ to find the A, B, C, and D parameters to skip, not to the next pattern, but just to the next position on the horizontal sequence. This is likely the more highly performance option as it more easily allows for extreme clock cycle per operation reduction using in-lining of the unrolled loop code and will produce more efficient code for large ranges when page segmenting as the jumps per loop are smaller on average. This will likely make the clock cycles per loop close to that of a highly optimized Sieve of Eratosthenes as in primesieve, but will not likely get to quite that low due to having to compute the variable step sizes rather than being able to use fixed prime offsets for SoE.

So there are three governing objectives in reducing the running time for a prime sieve, as follows:

  1. A successful sieve reduces the number of operations, which even the “hit optimized” SoA fails as compared to a highly wheel factorized SoE by about 16.7% for ranges of about a few billion.

  2. A successful sieve reduces the CPU clock cycles per operation, which the SoA fails as compared to a highly optimized SoE such as primesieve because its operations are more complex due to the variable increments, again likely by 20% to 60%. Atkin and Bernstein’s primegen (SoA) takes about 4.5 as compared to about 2.5 clock cycles per operation for primesieve (SoE) for a range of one billion for each, but perhaps the SoA could be sped up somewhat by borrowing some of the optimization techniques from primesieve.

  3. A successful sieve has reasonably low code complexity so that it can be more easily extended to larger ranges using such techniques as “bucket sieving” and other page segmentation optimizations. For this the Sieve of Atkin fails miserably as it gets exponentially more complex for page segmenting large number ranges. It is extremely complex to write a SoA program that would compete with even Bernstein’s primegen (SoA) let alone with primesieve, whereas it is quite easy to write code that comes close to the same performance as primesieve.

  4. A successful sieve has a lower empirical complexity, which the SoA does have above the SoE by a factor of (log (log n) where n is the upper range to be sieved, but that extra small ratio is not likely ever enough to make up for the above top two combined loss ratios, as this factor gets smaller with increasing range. END_EDIT_ADD2

So the answer to the question “Why use the Sieve of Atkin?” is “There is no reason to use it at all if the SoE is implemented with maximum wheel factorization optimizations until the numbers sieved are extremely large (64-bit number range and higher) and then the SoA advantage is very small and perhaps not realizable at all depending on very minor tweaks in relative optimizations.”.

As an answer to another similar Sieve of Atkin question, I have posted a page segmented C# version of the SoE optimized as per the above at: https://stackoverflow.com/a/20559687/549617.

Leave a Comment