Draft Forbes Group Website (Build by Nikola). The official site is hosted at:
License: GPL3
ubuntu2004
Optimization with Numba
This notebook discussus how to achive high performance of vectorized operations on large chunks of data with numba.
Example
The example we shall start with is motivated by a nuclear physics problem. We need to compute the following term in a Hamiltonian representing the application of a two-body potential:
where are a set of internal (Jacobi) coordinates, and are linear combinations of these representing the various relative coordinates in the problem. Here we define the potential:
This tells us that the relative coordinate between particle 0 and particle 2 is 0.5*x[0] + x[1]
.
Now we make a mock wavefunction, and then apply the potential. This serves as the benchmark (but will be slow):
To vertorize this loop, one would generally evaluate the potential at all the required points, then perform the multiplication and update. The problem is that the array would be so large that we will not have space to store it. Thus, some sort of iterative loop is really required. Our first step is to consider numba
:
Numba
Start by perusing the Numba Language Specification. This will show what things one can do.
We start with a simplified loop (assuming all abscisssa to be the same etc.) in order to ensure an easy transition to a compiled version.
Now we try applying the numba.autojit
decorator:
One way to see what is taking time is to use the annotate feature of numba. Unfortunately, this does not yet work very well in a notebook, so we shall first write a version of this code in an external file:
__pycache__ bcs.pyc git-annex.ipynb homogeneous.pyc mac-os-x.ipynb
apply_V_core_numba.py bdg-equations-in-1d.ipynb grading-with-google-sheets-wsu.ipynb linux.ipynb optimization_with_numba.ipynb
bcs.py conda-pip-and-all-that.ipynb homogeneous.py logging-with-python.ipynb path-integrals.ipynb
The file /Users/mforbes/work/mmfbb/forbes_group_website/posts/draft/blog/apply_V_core_numba.html does not exist.
The output looks something like this:
Note that almost every line as an "*". This means that the the "object layer" is being used which uses the python interpreter and so is slow.. Now we can try to learn how to improve this so that the compiler can generate efficient code. Here is a list of the problems:
x.size
is not recognized, but one can uselen(x)
.Functions like
np.dot
andno.linalg.norm
are not supported outside of the object layer.Slicing for assignment like
ri[:] = ...
seems not to be supported outside of the object layer.Ensure that objects have the correct type... (Apparently numba does not properly cast arguments, even if the type is explicitly specified.) I was having problems with the argument
p
which was being passed as a list of integer for example.
Now we see that everything happens in under a microsecond. Unfortunately annotate no-longer works because of the line r = c[0] * ri + c[1] * rj
... Not sure why.
Static Compilation
One problem with using JIT compilation is that it can take a while to get going since the compiler needs to be called. To alliviate this, it would be great if the compilation results could be cached between runs, but this feature is not yet available in numba. (See Issue #224.) Instead, we must resort to static compilation. Here we test this with both functions and classes (the latter can be useful to maintain state.)
SMP: prange
Here we demonstrate how to use multiple cores on a single node using numba.prange
. A couple of points to note:
prange
cannot be nested, so you must unravel the loops you want to parallelize.One cannot use
if
statements in aprange
loop, but they can be hidden inside a function call.
Benchmarking Vectorization
Here we compare several techniques for vectorization. We shall apply the sin
operator to a complex array and sum the result in the end to ensure the calculation proceeded correctly. Note that the choise of sin
may not be optimal as the time it takes depends on the argument, but if we use the same initial array, we should get a reasonable comparison. As a benchmark, we start with a straightforward C++ version.
Single Thread C++
sin.cpp:17:9: warning: unknown pragma ignored [-Wunknown-pragmas]
#pragma omp parallel for private(i)
^
1 warning generated.
Loop took 11s
res = (1.12921,-0.618922)
real 0m10.872s
user 0m10.697s
sys 0m0.173s
Python Numpy
This is the canonically way of performing such calculations in python. It would be extremely inefficient to use a for
loop as python does not handle them well. If you can structure your calculation to use numpy
in this matter – applying a single function to an array – then things will be much faster as the loop over elements of the array is performed inside the compiled functions. Unfortunately, to realize this speedup, the function must be compiled with the looping inside, so you cannot easily use this functionality for custom functions. (There is a numpy.vectorize
function that appears to do this, but it is only for convenience, not for performance.)
NumExpr
Numexpr evaluates compiled expressions on a virtual machine, and pays careful attention to memory bandwith. The first time a function is called, it will be compiled – subsequent calls will be fast. One can also use an explicit instantiation of a function via the NumExpr
class (but this is not very well documented – check the numexpr
sources.) One strategy that I often use is to generate the expression symbolically using sympy
. One can then perform some symbolic optimizations (or compute derivatives and other symbolic manipulations) and compile the final expressions into an efficient function. This approach is also one of the easiest ways of to leverage multiple cores (the threading is done using the ctypes
interface that allows numexpr
to release the GIL.)
scipy.weave.inline
The scipy.weave.inline
function can be used to compile code on the fly. The advantage here is that one can allocate and prepare the arrays in python, then just drop into c++ for the core loop. Another less obvious benefit is tha one can use python as a meta-language to dynamically generate the C++ code. (One use is to generate multiply nested loops to run over N-dimensions where N changes at runtime.)
The generated boiler plate provides both 1D and multi-dimensional access to the array elements, size, etc. through various macros. Unfortunately, there is no easy way that I know of to utilize multiple cores with this approach (though, of course, one could use the full power of C++, but this is not simple).
NOTE: Weave only works with Python 2*
File "<ipython-input-2-e17d59fa9d50>", line 22
print Y.sum()
^
SyntaxError: invalid syntax
Numba
The numba
package provides a way to convert a subset of python code into LLVM which then gets compiled. There are two options – the simplest is to use the autojit
decorator, which will determine the required types from the arguments when the function is called. The second approach is to use the jit
decorator, which allows one to specify the type (and which will perform the compilations on import.
Now we use explicit types with jit
:
Numba allows one to use multiple threads by using numba.prange
, but this places even more restrictions on what can appear in the loop. For example, one cannot use conditional (but they can be used in nested function calls). This take a LOT longer to compile though.
000010:INFO:numba.llvm_types:
INFO -- llvm_types:107:build_int_cast
Warning: Perfoming downcast. May lose information.
000011:INFO:numba.llvm_types:
INFO -- llvm_types:107:build_int_cast
Warning: Perfoming downcast. May lose information.
000012:INFO:numba.llvm_types:
INFO -- llvm_types:107:build_int_cast
Warning: Perfoming downcast. May lose information.