Dask is a moder parallel computing library completely written in Python. It is extremely flexible, being able to work well on a laptop, using all available cores in parallel, or scale up to a cluster of hundreds of nodes.
Instead of forcing you to wrap your code to use the map-reduce paradigm, it mimics the numpy array and pandas dataframe interfaces, so you can continue doing everything the same way you always do.
Dask main abstraction is a Directed Acyclic Graph called "dask" (distributed task) implemented as a simple dictionary. The different interfaces (bag, array, dataframe) create these dasks, that are later computed in a distributed fashion using a suitable scheduler.
Forget about the JVM overhead. The future is now, the future is dask!
Físico fracasado, me convertí en friki interesado en Python, desarrollo web, DevOps, análisis de datos y demás parafernalia. Como me gusta hablar mucho, a veces doy charlas.