Open ended session with a short code to try some of these techniques out on.
Competition : Who can achieve the greatest speed-up?
There is more content in these slides than we'll cover today, but hoepfully they will be a useful reference for the future.
The optimisation cycle
Profile
Develop regression tests
Baseline timing
Optimise identified bottlenecks
Time
Test
Profile then go back to 4 or call it a day and bask in glory
cProfile
cProfile is part of the Python standard library (you don't need to manually install it).
It gives statistics for function calls and is often the best place to start when profiling Python code.
import cProfile
import pstats
profile = cProfile.Profile()
profile.enable()
dn_dlogm_list = []
for z in z_list:
dn_dlogm_list.append(press_schecter(M_list, z, A_std_func()))
profile.disable()
ps = pstats.Stats(profile).strip_dirs().sort_stats(pstats.SortKey.TIME)
ps.print_stats(10)
cProfile
cProfile is part of the Python standard library (you don't need to manually install it).
It gives statistics for function calls and is often the best place to start when profiling Python code.
import cProfile
import pstats
profile = cProfile.Profile()
profile.enable()
dn_dlogm_list = []
for z in z_list:
dn_dlogm_list.append(press_schecter(M_list, z, A_std_func()))
profile.disable()
ps = pstats.Stats(profile).strip_dirs().sort_stats(pstats.SortKey.TIME)
ps.print_stats(10)
cProfile
cProfile is part of the Python standard library (you don't need to manually install it).
It gives statistics for function calls and is often the best place to start when profiling Python code.
import cProfile
import pstats
profile = cProfile.Profile()
profile.enable()
dn_dlogm_list = []
for z in z_list:
dn_dlogm_list.append(press_schecter(M_list, z, A_std_func()))
profile.disable()
ps = pstats.Stats(profile).strip_dirs().sort_stats(pstats.SortKey.TIME)ps.print_stats(10)
Dropping down to a lower level using Numba, Cython, etc.
Parallelisation
Memoisation
The idea
Sometimes we have a function that may be called many times with the same input. In these situations it may be faster to store the results and reuse them rather than recalculating them each time.
Works well when
You have an expensive function being called a number of times with the same input.
You have any non-trivial function being called many, many times with the same small input.
Memoisation
A (very contrived) example1
import numpy as np
points = np.random.rand(10_000, 2)
defmyfunc(x, y):return (np.log10(np.sqrt(x**2 + y**2) * 5) / 0.2)**6
result = 0.0for _ inrange(100):
result += np.array([myfunc(x, y) for x, y in points])
5.02 s ยฑ 242 ms per loop (mean ยฑ std. dev. of 7 runs, 1 loop each)
Memoisation
A (very contrived) example1
from functools import cacheimport numpy as np
points = np.random.rand(10_000, 2)
@cachedefmyfunc(x, y):return (np.log10(np.sqrt(x**2 + y**2) * 5) / 0.2)**6
result = 0.0for _ inrange(100):
result += np.array([myfunc(x, y) for x, y in points])
1.28 s ยฑ 423 ms per loop (mean ยฑ std. dev. of 7 runs, 1 loop each)
A factor of ~4 speed-up by adding 2 lines... Not bad!
Memoisation
However...
functools.cache stores the result in RAM and needs to compare each input argument against the cached results.
This works well when arguments and output are small e.g floats ints etc.
joblib.Memory
These limitations can be overcome in some cases with joblib.Memory.
It caches to disk and removes much of the overhead with large input and output arrays.
Going lower level...
Sometimes the best way to improve the speed of our code is to use another language!...
But fear not! This does not mean having to re-write your code in C/C++/Rust etc.
Does Just In Time (JIT) compilation of Python code. It can be incredibly easy to use and can provide big speed gains. It can also be used for GPU programming.
A python-superset language that can be used to create efficient C extensions. As well as providing similar speed-ups to Numba (with a bit more effort), it can be used in more complex situations and also provides a great and flexible way to interface C libraries with Python.
Numba
Using Numba can be incredibly easy:
import numba
import numpy as np
@numba.njitdefmyfunc(arr, n_loops):
result = np.zeros(arr.shape[0])
for _ inrange(n_loops):
result += (np.log10(np.sqrt(arr[:, 0]**2 + arr[:, 1]**2) * 5) / 0.2)**6return result
points = np.random.rand(10_000, 2)
result = myfunc(points, 100)
njit means compile in "no-python" mode. This requires the function to be simple and use only support functions and types (most of Numpy) but not Python classes or dicts.
Loops are fine! They will be optimised by LLVM (the compiler) for your CPU.
15.7 ms with decorator vs. 42.5 ms without = ~2.5x speed-up
Where Numba might not be enough (WARNING: very subjective!)
It only supports a subset of Python + Numpy (although a very large one).
It can be hard to debug if things don't work as expected.
There isn't much scope for workarounds. If it doesn't "just work" then often it won't work.
Despite these, I definitely recommend Numba as the first-stop for speeding up numerical code!
Parallelisation
Almost all computers have multiple cores. In some cases we can take advantage of this to speed up our calculations. Numpy & Scipy actually do this for some functions without you necessarily realising.
This technique works well when:
you have an expensive function
being called multiple times with different arguments
independently of each other.
Extra Native parallelisation in Python is expensive
The GIL
The Global Interpreter Lock (GIL) is a mutex which ensures only one thread can use the interpreter at any one time. It's a feature that makes Python simple and easy to use.
But it's a hurdle to efficient parallelisation.
To get around this natively we can use the multiprocessing standard library.
It uses multiple processes, each running their own Python interpreter, with their own private memory space. But this is expensive, so we need to have enough work for each process to make it worthwhile.
Parallelisation with Numba
As it turns out, Numba (does) and Cython (can) release the GIL!
Numba can parallelise code automatically in some cases (see the docs), but we can also manually request it:
import numba
import numpy as np
@numba.njit(parallel=True)defmyfunc(arr, n_loops):
result = np.zeros(arr.shape[0])
for _ in numba.prange(n_loops): result += (np.log10(np.sqrt(arr[:, 0]**2 + arr[:, 1]**2) * 5) / 0.2)**6return result
points = np.random.rand(10_000, 2)
result = myfunc(points, 10_000)
Even with Numba, the best performance gains only come when the units of work are large enough.
For more information and a good, flexible library for multiple different parallelisation backends, see joblib.
Practical time!
Practical
Main suggestion
Optimize code_prac/press_schechter.py and see how fast you can make it1.
Time your attempts using python -m timeit -s 'import code_prac' 'code_prac.press_schechter.main()'
If you followed the setup instructions in the repo then you should have all of the libraries and packages you may need installed.
Compete for the title of "Optimizer Prime" !
There are no rules, except that we must be able to call code_prac.press_schechter.mass_function with the same arguments as the starting code and get the same result.
Practical
and/or
Take advantage of the knowledge of your peers and practice your skills with the press_schechter code.
Some ideas include:
Add inline documentation and build docs using Sphinx
Tip: use the Napolean extension and Numpy or Googledoc style docstrings.
Bonus: Fork of the code_prac_hwsa2021 repo on Github and serve the docs online using Github Pages.
Develop unit tests for the code
Bonus: Use coverage.py tool to measure your unit test coverage and aim for >90% coverage.
Bonus bonus: Fork the code_prac_hwsa2021 repo and use Github Actions to automatically test the code on every push.
Add type annotations to the code then use mypy to check these.
If you manage to speed up the press_schechter function enough (or ask me for a faster version), try making an interactive tool for exploring the PS mass function using Jupyter Notebooks and ipywidgets (or a tool of your choice).