Big Data

Distributed computing with Dask

Israel Saeta Pérez

  • 9 de Octubre de 2016, 12:30 - 13:10
  • Sala Cajamar
  • Idioma: en

Dask is a moder parallel computing library completely written in Python. It is extremely flexible, being able to work well on a laptop, using all available cores in parallel, or scale up to a cluster of hundreds of nodes.

Instead of forcing you to wrap your code to use the map-reduce paradigm, it mimics the numpy array and pandas dataframe interfaces, so you can continue doing everything the same way you always do.

Dask main abstraction is a Directed Acyclic Graph called "dask" (distributed task) implemented as a simple dictionary. The different interfaces (bag, array, dataframe) create these dasks, that are later computed in a distributed fashion using a suitable scheduler.

Forget about the JVM overhead. The future is now, the future is dask!

Acerca de

Físico fracasado, me convertí en friki interesado en Python, desarrollo web, DevOps, análisis de datos y demás parafernalia. Como me gusta hablar mucho, a veces doy charlas.