The Blaze Ecosystem provides Python users high-level access to efficient computation on inconveniently large data. Blaze can refer to both a particular library as well as an ecosystem of related projects that have spun off of Blaze development.
Parts of the Blaze ecosystem are described below:
Several projects have come out of Blaze development other than the Blaze project itself.
The Blaze Project: Translates NumPy/Pandas-like syntax to data computing systems (e.g. database, in-memory, distributed-computing). This provides Python users with a familiar interface to query data living in a variety of other data storage systems. One Blaze query can work across data ranging from a CSV file to a distributed database.
Blaze presents a pleasant and familiar interface to us regardless of what computational solution or database we use (e.g. Spark, Impala, SQL databases, No-SQL data-stores, raw-files). It mediates our interaction with files, data structures, and databases, optimizing and translating our query as appropriate to provide a smooth and interactive session. It allows the data scientists and analyst to write their queries in a unified way that does not have to change because the data is stored in another format or a different data-store. It also provides a server-component that allows URIs to be used to easily serve views on data and refer to Data remotely in local scripts, queries, and programs.
DataShape: A data type system
DataShape combines NumPy’s dtype and shape and extends to missing data, variable length strings, ragged arrays, and more arbitrary nesting. It allows for the common description of data types from databases to HDF5 files, to JSON blobs.
Odo: Migrates data between formats.
Odo moves data between formats (CSV, JSON, databases) and locations (local, remote, HDFS) efficiently and robustly with a dead-simple interface by leveraging a sophisticated and extensible network of conversions.
DyND: In-memory dynamic arrays
DyND is a dynamic ND-array library like NumPy that implements the datashape type system. It supports variable length strings, ragged arrays, and GPUs. It is a standalone C++ codebase with Python bindings. Generally it is more extensible than NumPy but also less mature.
Dask.array: Multi-core / on-disk NumPy arrays
Dask.dataframe : Multi-core / on-disk Pandas data-frames
Dask.arrays provide blocked algorithms on top of NumPy to handle larger-than-memory arrays and to leverage multiple cores. They are a drop-in replacement for a commonly used subset of NumPy algorithms.
Dask.dataframes provide blocked algorithms on top of Pandas to handle larger-than-memory data-frames and to leverage multiple cores. They are a drop-in replacement for a subset of Pandas use-cases.
Dask also has a general “Bag” type and a way to build “task graphs” using simple decorators as well as nascent distributed schedulers in addition to the multi-core and multi-threaded schedulers.
These projects are mutually independent. The rest of this documentation is
just about the Blaze project itself. See the pages linked to above for
Blaze is a high-level user interface for databases and array computing systems. It consists of the following components:
- A symbolic expression system to describe and reason about analytic queries
- A set of interpreters from that query system to various databases / computational engines
This architecture allows a single Blaze code to run against several computational backends. Blaze interacts rapidly with the user and only communicates with the database when necessary. Blaze is also able to analyze and optimize queries to improve the interactive experience.
- Basic Queries
- Split-Apply-Combine – Grouping
- Pandas to Blaze
- SQL to Blaze
- URI strings
- Tips for working with CSV files
- Interacting with SQL Databases
- Out of Core Processing
- Time Series Operations
- What Blaze Doesn’t Do
- Release Notes