Pandas to Blaze

This page maps pandas constructs to blaze constructs.

Imports and Construction

import numpy as np
import pandas as pd
from blaze import data, by, join, merge, concat

# construct a DataFrame
df = pd.DataFrame({
   'name': ['Alice', 'Bob', 'Joe', 'Bob'],
   'amount': [100, 200, 300, 400],
   'id': [1, 2, 3, 4],
})

# put the `df` DataFrame into a Blaze Data object
df = data(df)
Computation Pandas Blaze
Column Arithmetic
df.amount * 2
df.amount * 2
Multiple Columns
df[['id', 'amount']]
df[['id', 'amount']]
Selection
df[df.amount > 300]
df[df.amount > 300]
Group By
df.groupby('name').amount.mean()
df.groupby(['name', 'id']).amount.mean()
by(df.name, amount=df.amount.mean())
by(merge(df.name, df.id),
   amount=df.amount.mean())
Join
pd.merge(df, df2, on='name')
join(df, df2, 'name')
Map
df.amount.map(lambda x: x + 1)
df.amount.map(lambda x: x + 1,
              'int64')
Relabel Columns
df.rename(columns={'name': 'alias',
                   'amount': 'dollars'})
df.relabel(name='alias',
           amount='dollars')
Drop duplicates
df.drop_duplicates()
df.name.drop_duplicates()
df.distinct()
df.name.distinct()
Reductions
df.amount.mean()
df.amount.value_counts()
df.amount.mean()
df.amount.count_values()
Concatenate
pd.concat((df, df))
concat(df, df)
Column Type Information
df.dtypes
df.amount.dtype
df.dshape
df.amount.dshape

Blaze can simplify and make more readable some common IO tasks that one would want to do with pandas. These examples make use of the odo library. In many cases, blaze will able to handle datasets that can’t fit into main memory, which is something that can’t be easily done with pandas.

from odo import odo
Operation Pandas Blaze
Load directory of CSV files
df = pd.concat([pd.read_csv(filename)
                for filename in
                glob.glob('path/to/*.csv')])
df = data('path/to/*.csv')
Save result to CSV file
df[df.amount < 0].to_csv('output.csv')
odo(df[df.amount < 0],
    'output.csv')
Read from SQL database
df = pd.read_sql('select * from t', con='sqlite:///db.db')

df = pd.read_sql('select * from t',
                 con=sa.create_engine('sqlite:///db.db'))
df = data('sqlite://db.db::t')