Pandas to Blaze¶
This page maps pandas constructs to blaze constructs.
Imports and Construction¶
import numpy as np
import pandas as pd
from blaze import data, by, join, merge, concat
# construct a DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Joe', 'Bob'],
'amount': [100, 200, 300, 400],
'id': [1, 2, 3, 4],
})
# put the `df` DataFrame into a Blaze Data object
df = data(df)
Computation | Pandas | Blaze |
---|---|---|
Column Arithmetic | df.amount * 2
|
df.amount * 2
|
Multiple Columns | df[['id', 'amount']]
|
df[['id', 'amount']]
|
Selection | df[df.amount > 300]
|
df[df.amount > 300]
|
Group By | df.groupby('name').amount.mean()
df.groupby(['name', 'id']).amount.mean()
|
by(df.name, amount=df.amount.mean())
by(merge(df.name, df.id),
amount=df.amount.mean())
|
Join | pd.merge(df, df2, on='name')
|
join(df, df2, 'name')
|
Map | df.amount.map(lambda x: x + 1)
|
df.amount.map(lambda x: x + 1,
'int64')
|
Relabel Columns | df.rename(columns={'name': 'alias',
'amount': 'dollars'})
|
df.relabel(name='alias',
amount='dollars')
|
Drop duplicates | df.drop_duplicates()
df.name.drop_duplicates()
|
df.distinct()
df.name.distinct()
|
Reductions | df.amount.mean()
df.amount.value_counts()
|
df.amount.mean()
df.amount.count_values()
|
Concatenate | pd.concat((df, df))
|
concat(df, df)
|
Column Type Information | df.dtypes
df.amount.dtype
|
df.dshape
df.amount.dshape
|
Blaze can simplify and make more readable some common IO tasks that one would want to do with pandas. These examples make use of the odo library. In many cases, blaze will able to handle datasets that can’t fit into main memory, which is something that can’t be easily done with pandas.
from odo import odo
Operation | Pandas | Blaze |
---|---|---|
Load directory of CSV files | df = pd.concat([pd.read_csv(filename)
for filename in
glob.glob('path/to/*.csv')])
|
df = data('path/to/*.csv')
|
Save result to CSV file | df[df.amount < 0].to_csv('output.csv')
|
odo(df[df.amount < 0],
'output.csv')
|
Read from SQL database | df = pd.read_sql('select * from t', con='sqlite:///db.db')
df = pd.read_sql('select * from t',
con=sa.create_engine('sqlite:///db.db'))
|
df = data('sqlite://db.db::t')
|