Interacting with SQL Databases

How to

Typically one provides a SQL connection string to the data constructor

>>> db = data('postgresql:///user:pass@hostname')  

or

>>> t = data('postgresql://user:pass@hostname::my-table-name')  

Alternatively users familiar with SQLAlchemy can pass any SQLAlchemy engine, metadata, or Table objects to data. This can be useful if you need to specify more information that does not fit comfortably into a URI (like a desired schema.)

>>> import sqlalchemy  
>>> engine = sqlalchemy.create_engine('postgreqsql://hostname')  

>>> db = data(engine)  

How does it work?

As you manipulate a Blaze expression Blaze in turn manipulates a SQLAlchemy expression. When you ask for a result SQLAlchemy generates the SQL appropriate for your database and sends the query to the database to be run.

What databases does Blaze support?

Blaze derives all SQL support from SQLAlchemy so really one should ask, What databases does SQLAlchemy support?. The answer is quite a few in the main SQLAlchemy project and most when you include third party libraries.

However, URI support within Blaze is limited to a smaller set. For exotic databases you may have to create a sqlalchemy.engine explicitly as shown above.

What operations work on SQL databases?

Most tabular operations, but not all. SQLAlchemy translation is a high priority. Failures include array operations like slicing and dot products don’t make sense in SQL. Additionally some operations like datetime access are not yet well supported through SQLAlchemy. Finally some databases, like SQLite, have limited support for common mathematical functions like sin.

How can I try this out?

The easiest way to play with SQL is to download a SQLite database. We recommend the Lahman baseball statistics database. After downloading one could connect blaze to that database with the following code

>>> from blaze import data
>>> db = data('sqlite:///Downloads/lahman2013.sqlite') 
>>> db.<tab>  # see available tables  
db.AllstarFull          db.FieldingOF           db.Schools              db.fields
db.Appearances          db.FieldingPost         db.SchoolsPlayers       db.isidentical
db.AwardsManagers       db.HallOfFame           db.SeriesPost           db.like
db.AwardsPlayers        db.Managers             db.Teams                db.map
db.AwardsShareManagers  db.ManagersHalf         db.TeamsFranchises      db.relabel
db.AwardsSharePlayers   db.Master               db.TeamsHalf            db.schema
db.Batting              db.Pitching             db.apply                db.temp
db.BattingPost          db.PitchingPost         db.data
db.Fielding             db.Salaries             db.dshape
>>> db.Teams.peek()  # view one particular database  
    yearID lgID teamID franchID divID  Rank   G  Ghome   W   L     ...       \
0     1871   NA    BS1      BNA  None     3  31    NaN  20  10     ...
1     1871   NA    CH1      CNA  None     2  28    NaN  19   9     ...
2     1871   NA    CL1      CFC  None     8  29    NaN  10  19     ...
3     1871   NA    FW1      KEK  None     7  19    NaN   7  12     ...

    DP    FP                     name                               park  \
0  NaN  0.83     Boston Red Stockings                South End Grounds I
1  NaN  0.82  Chicago White Stockings            Union Base-Ball Grounds
2  NaN  0.81   Cleveland Forest Citys       National Association Grounds
3  NaN  0.80     Fort Wayne Kekiongas                     Hamilton Field

    attendance  BPF  PPF  teamIDBR  teamIDlahman45  teamIDretro
0          NaN  103   98       BOS             BS1          BS1
1          NaN  104  102       CHI             CH1          CH1
2          NaN   96  100       CLE             CL1          CL1
3          NaN  101  107       KEK             FW1          FW1
...

One can then query and compute results as with a normal blaze workflow.

Connecting to a Schema Other than public with PostgreSQL

To connect to a non-default schema, one may pass a sqlalchemy.MetaData object to data. For example:

>>> from blaze import data
>>> from sqlalchemy import MetaData
>>> ds = data(MetaData('postgresql://localhost/test', schema='my_schema')) 
>>> ds.dshape  
dshape("{table_a: var * {a: ?int32}, table_b: var * {b: ?int32}}")

Foreign Keys and Automatic Joins

Often times one wants to access the columns of a table into which we have a foreign key.

For example, given a products table with this schema:

create table products (
    id integer primary key,
    name text
)

and an orders table with this schema:

create table orders (
    id integer primary key,
    product_id integer references (id) products,
    quantity integer
)

we want to get the name of the products in every order. In SQL, you would write the following join:

select
    o.id, p.name
from
    orders o
        inner join
    products p
        on o.product_id = p.id

This is fairly straightforward. However, when you have more than two joins the SQL gets unruly and hard to read. What we really want is a syntactically simply way to follow the chain of foreign key relationships and be able to access columns in foreign tables without having to write a lot of code. This is where blaze comes in.

Blaze can generate the above joins for you, so instead of writing a bunch of joins in SQL you can simply access the columns of a foreign table as if they were columns on the foreign key column.

The previous example in blaze looks like this:

>>> from blaze import data, compute
>>> d = data('postgresql://localhost/db')  
>>> d.fields  
['products', 'orders']
>>> expr = d.orders.product_id.name  
>>> print(compute(expr))  
SELECT orders.id, p.name
FROM orders as o, products as p
WHERE o.product_id = p.id

Warning

The above feature is very experimental right now. We would appreciate bug reports and feedback on the API.