Is there a reason Python 3 enumerates slower than Python 2?

A summary answer of what I’ve learned from this question might be of help to others who wonder the same things as I did:

The reason for the slowdown is that all integer variables in Python 3.x are now “infinite precision” as the type that used to be called “long” in Python 2.x but is now the only integer type as decided by PEP 237. As per that document, “short” integers that had the bit-depth of the base architecture no longer exist (or only internally).
The old “short” variable operations could run reasonably fast because they could use the underlying machine code operations directly and optimized the allocation of new “int” objects because they always had the same size.
The “long” type is currently only represented by a class object allocated in memory as it could exceed a given fixed length register/memory location’s bit-depth; since these object representations could grow or shrink for various operations and thus have a variable size, they cannot be given a fixed memory allocation and left there.
These “long” types (currently) don’t use a full machine architecture word size but reserve a bit (normally the sign bit) to do overflow checks, thus the “infinite precision long” is divided (currently) into 15-bit/30-bit slice “digits” for 32-bit/64-bit architectures, respectively.
Many of the common uses of these “long” integers won’t require more than one (or maybe two for 32-bit architectures) “digits” as the range of one “digit” is about one billion/32768 for 64-bit/32-bit architectures, respectively.
The ‘C’ code is reasonably efficient for doing one or two “digit” operations, so the performance cost over the simpler “short” integers isn’t all that high for many common uses as far as actual computation goes as compared to the time required to run the byte-code interpreter loop.
The biggest performance hit is likely the constant memory allocations/deallocations, one pair for each loop integer operations which is quite expensive, especially as Python moves to support multi-threading with synchronization locks (which is likely why Python 3.4 is worse than 3.3).
Currently, the implementation always ensures sufficient “digits” by allocating one extra “digit” above the actual size of “digits” used for the biggest operand if there is a possibility that it might “grow”, doing the operation (which may or may not actually use that extra “digit”), and then normalizes the result length to account for the actual number of “digits” used, which may actually stay the same (or possibly “shrink” for some operations); this is done by just reducing the size count in the “long” structure without a new allocation so may waste one “digit” of memory space but saving the performance cost of yet another allocation/deallocation cycle.
There is hope for an performance improvement: For many operations it is possible to predict whether the operation will cause a “grow” or not – for instance, for an addition one just needs to look at the Most Significant Bits (MSB’s) and the operation cannot grow if both MSB’s are zero, which will be the case for many loop/counter operations; a subtraction won’t “grow” depending on the signs and MSB’s of the two operands; a left shift will only “grow” if the MSB is a one; et cetera.
For those cases where the statement is something like “cnt += 1″https://stackoverflow.com/”i += step” and so on (opening the possibility of operations in place for many use cases), an “in place” version of the operations could be called which would do the appropriate quick checks and only allocate a new object if a “grow” was necessary, otherwise doing the operation in place of the first operand. The complication would be that the compiler would need to produce these “in-place” byte codes, however, that has already been done, with appropriate special “in-place operation” byte codes produced, just that the current byte-code interpreter directs them to the usual version as described above because they have not yet been implemented (zero’d/null values in the table).
It may well be that all that has to be done is write versions of these “in-place operations” and fill them into the “long” methods table with the byte-code interpreter already finding and running them if they exist or minor changes to a table to cause it to call them being all that is required.

Note that floats are always the same size, so could have this same improvements made, although floats are allocated in blocks of spare locations for better efficiency; it would be much harder to do that for “long”‘s as they take a variable amount of memory.

Also note that this would break the immutability of “long”‘s (and optionally float’s), which is why there are no inplace operators defined, but the fact that they are treated as mutable only for these special cases doesn’t affect the outside world as it would never realize that sometimes a given object has the same address as the old value (as long as equality comparisons look at contents and not just object addresses).

I believe that by avoiding the memory allocation/delocation for these common use cases, the performance of Python 3.x will be quite close to Python 2.7.

Much of what I’ve learned here comes from the Python trunk ‘C’ source file for the “long” object

EDIT_ADD: Whoops, forgot that if variables are sometimes mutable, then closures on local variables don’t work or don’t work without major changes, meaning that the above inplace operations would “break” closures. It would seem that a better solution would be to get advance spare allocation apace working for “long”‘s just as it used to for short integer’s and does for float’s, even if only for the cases where the “long” size doesn’t change (which covers the majority of the time such as for loops and counters as per the question). Doing this should mean that the code doesn’t run that much slower than Python 2 for typical use.

More Related Contents:

Leave a Comment Cancel reply