Server¶
Blaze provides uniform access to a variety of common data formats. Blaze Server builds off of this uniform interface to host data remotely through a JSON web API.
Setting up a Blaze Server¶
To demonstrate the use of the Blaze server we serve the iris csv file.
>>> # Server code, run this once. Leave running.
>>> from blaze import *
>>> from blaze.utils import example
>>> csv = CSV(example('iris.csv'))
>>> data(csv).peek()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
...
Then we host this publicly on port 6363
from blaze.server import Server
server = Server(csv)
server.run(host='0.0.0.0', port=6363)
A Server is the following
- A dataset that blaze understands or dictionary of such datasets
- A Flask app.
With this code our machine is now hosting our CSV file through a web-application on port 6363. We can now access our CSV file, through Blaze, as a service from a variety of applications.
Serving Data from the Command Line¶
Blaze ships with a command line tool called blaze-server to serve up data
specified in a YAML file.
Note
To use the YAML specification feature of Blaze server please install
the pyyaml library. This can be done easily with conda:
conda install pyyaml
YAML Specification¶
The structure of the specification file is as follows:
name1: source: path or uri dshape: optional datashape name2: source: path or uri dshape: optional datashape ... nameN: source: path or uri dshape: optional datashape
Note
When source is a directory, Blaze will recurse into the directory tree
and call odo.resource on the leaves of the tree.
Here’s an example specification file:
iriscsv: source: ../examples/data/iris.csv irisdb: source: sqlite:///../examples/data/iris.db accounts: source: ../examples/data/accounts.json.gz dshape: "var * {name: string, amount: float64}"
The previous YAML specification will serve the following dictionary:
>>> from odo import resource >>> resources = { ... 'iriscsv': resource('../examples/data/iris.csv'), ... 'irisdb': resource('sqlite:///../examples/data/iris.db'), ... 'accounts': resource('../examples/data/accounts.json.gz', ... dshape="var * {name: string, amount: float64}") ... }
The only required key for each named data source is the source key, which
is passed to odo.resource. You can optionally specify a dshape
parameter, which is passed into odo.resource along with the source key.
Advanced YAML usage¶
If odo.resource requires extra keyword arguments for a particular resource
type and they are provided in the YAML file, these will be forwarded on to the
resource call.
If there is an imports entry for a resource whose value is a list of module
or package names, Blaze server will import each of these modules or
packages before calling resource.
For example:
name1: source: path or uri dshape: optional datashape kwarg1: extra kwarg kwarg2: etc. name2: source: path or uri imports: ['mod1', 'pkg2']
For this YAML file, Blaze server will pass on kwarg1=... and kwarg2=...
to the resource() call for name1 in addition to the dshape=...
keyword argument.
Also, before calling resource on the source of name2, Blaze server
will first execute an import mod1 and import pkg2 statement.
Command Line Interface¶
- UNIX
# YAML file specifying resources to load and optionally their datashape $ cat example.yaml iriscsv: source: ../examples/data/iris.csv irisdb: source: sqlite:///../examples/data/iris.db accounts: source: ../examples/data/accounts.json.gz dshape: "var * {name: string, amount: float64}" # serve data specified in a YAML file and follow symbolic links $ blaze-server example.yaml --follow-links # You can also construct a YAML file from a heredoc to pipe to blaze-server $ cat <<EOF datadir: source: /path/to/data/directory EOF | blaze-server
- Windows
# If you're on Windows you can do this with powershell PS C:\> @' datadir: source: C:\path\to\data\directory '@ | blaze-server
Interacting with the Web Server from the Client¶
Computation is now available on this server at
localhost:6363/compute.json. To communicate the computation to be done
we pass Blaze expressions in JSON format through the request. See the examples
below.
Fully Interactive Python-to-Python Remote work¶
The highest level of abstraction and the level that most will probably want to work at is interactively sending computations to a Blaze server process from a client.
We can use Blaze server to have one Blaze process control another. Given our iris web server we can use Blaze on the client to drive the server to do work for us
# Client code, run this in a separate process from the Server
>>> from blaze import data, by
>>> t = data('blaze://localhost:6363') # doctest: +SKIP
>>> t # doctest: +SKIP
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
...
>>> by(t.species, min=t.petal_length.min(),
... max=t.petal_length.max()) # doctest: +SKIP
species max min
0 Iris-virginica 6.9 4.5
1 Iris-setosa 1.9 1.0
2 Iris-versicolor 5.1 3.0
We interact on the client machine through the data object but computations on this object cause communications through the web API, resulting in seemlessly interactive remote computation.
The blaze server and client can be configured to support various serialization
formats. These formats are exposed in the blaze.server module. The server
and client must both be told to use the same serialization format.
For example:
# Server setup.
>>> from blaze import Server
>>> from blaze.server import msgpack_format, json_format
>>> Server(my_data, formats=(msgpack_format, json_format).run() # doctest: +SKIP
# Client code, run this in a separate process from the Server
>>> from blaze import Client, data
>>> from blaze.server import msgpack_format, json_format
>>> msgpack_client = data(Client('localhost', msgpack_format)) # doctest: +SKIP
>>> json_client = data(Client('localhost', json_format)) # doctest +SKIP
In this example, msgpack_client will make requests to the
/compute.msgpack endpoint and will send and receive data using the msgpack
protocol; however, the json_client will make requests to the
/compute.json endpoint and will send and receive data using the json
protocol.
Using the Python Requests Library¶
Moving down the stack, we can interact at the HTTP request level with Blaze
serer using the requests library.
# Client code, run this in a separate process from the Server
>>> import json
>>> import requests
>>> query = {'expr': {'op': 'sum',
... 'args': [{'op': 'Field',
... 'args': [':leaf', 'petal_length']}]}}
>>> r = requests.get('http://localhost:6363/compute.json',
... data=json.dumps(query),
... headers={'Content-Type': 'application/vnd.blaze+json'}) # doctest: +SKIP
>>> json.loads(r.content) # doctest: +SKIP
{u'data': 563.8000000000004,
u'names': ['petal_length_sum'],
u'datashape': u'{petal_length_sum: float64}'}
Now we use Blaze to generate the query programmatically
>>> from blaze import symbol
>>> from blaze.server import to_tree
>>> from pprint import pprint
>>> # Build a Symbol like our served iris data
>>> dshape = """var * {
... sepal_length: float64,
... sepal_width: float64,
... petal_length: float64,
... petal_width: float64,
... species: string
... }""" # matching schema to csv file
>>> t = symbol('t', dshape)
>>> expr = t.petal_length.sum()
>>> d = to_tree(expr, names={t: ':leaf'})
>>> query = {'expr': d}
>>> pprint(query)
{'expr': {'args': [{'args': [':leaf', 'petal_length'], 'op': 'Field'},
[0],
False],
'op': 'sum'}}
Alternatively we build a query to grab a single column
>>> pprint(to_tree(t.species, names={t: ':leaf'}))
{'args': [':leaf', 'species'], 'op': 'Field'}
Using curl¶
In fact, any tool that is capable of sending requests to a server is able to send computations to a Blaze server.
We can use standard command line tools such as curl to interact with the
server:
$ curl \
-H "Content-Type: application/vnd.blaze+json" \
-d '{"expr": {"op": "Field", "args": [":leaf", "species"]}}' \
localhost:6363/compute.json
{
"data": [
"Iris-setosa",
"Iris-setosa",
...
],
"datashape": "var * {species: string}",
}
$ curl \
-H "Content-Type: application/vnd.blaze+json" \
-d '{"expr": {"op": "sum", \
"args": [{"op": "Field", \
"args": [":leaf", "petal_Length"]}]}}' \
localhost:6363/compute.json
{
"data": 563.8000000000004,
"datashape": "{petal_length_sum: float64}",
}
These queries deconstruct the Blaze expression as nested JSON. The ":leaf"
string is a special case pointing to the base data. Constructing these queries
can be difficult to do by hand, fortunately Blaze can help you to build them.
Adding Data to the Server¶
Data resources can be added to the server from the client by sending a resource URI to the server. The data initially on the server must have a dictionary-like interface to be updated.
>>> from blaze.utils import example
>>> query = {'accounts': example('accounts.csv')}
>>> r = requests.get('http://localhost:6363/add',
... data=json.dumps(query),
... headers={'Content-Type': 'application/vnd.blaze+json'})
Advanced Use¶
Blaze servers may host any data that Blaze understands from a single integer
>>> server = Server(1)
To a dictionary of several heterogeneous datasets
>>> server = Server({
... 'my-dataframe': df,
... 'iris': resource('iris.csv'),
... 'baseball': resource('sqlite:///baseball-statistics.db')
... })
A variety of hosting options are available through the Flask project
>>> help(server.app.run)
Help on method run in module flask.app:
run(self, host=None, port=None, debug=None, **options) method of flask.app.Flask instance
Runs the application on a local development server. If the
:attr:`debug` flag is set the server will automatically reload
for code changes and show a debugger in case an exception happened.
...
Caching¶
Caching results on frequently run queries may significantly improve user experience in some cases. One may wrap a Blaze server in a traditional web-based caching system like memcached or use a data centric solution.
The Blaze CachedDataset might be appropriate in some situations. A cached
dataset holds a normal dataset and a dict like object.
>>> dset = {'my-dataframe': df,
... 'iris': resource('iris.csv'),
... 'baseball': resource('sqlite:///baseball-statistics.db')}
>>> from blaze.cached import CachedDataset
>>> cached = CachedDataset(dset, cache=dict())
Queries and results executed against a cached dataset are stored in the cache
(here a normal Python dict) for fast future access.
If accumulated results are likely to fill up memory then other, on-disk
dict-like structures can be used like Shove or Chest.
>>> from chest import Chest
>>> cached = CachedDataset(dset, cache=Chest())
These cached objects can be used anywhere normal objects can be used in Blaze,
including an interactive (and now performance cached) data object
>>> d = data(cached)
or a Blaze server
>>> server = Server(cached)
Flask Blueprint¶
If you would like to use the blaze server endpoints from within another flask application, you can register the blaze API blueprint with your application. For example:
>>> from blaze.server import api, json_format
>>> my_app.register_blueprint(api, data=my_data, formats=(json_format,))
When registering the API, you must pass the data that the API endpoints will serve. You must also pass an iterable of serialization format objects that the server will respond to.
Profiling¶
The blaze server allows users and server administrators to profile computations run on the server. This allows developers to better understand the performance profile of their computations to better tune their queries or the backend code that is executing the query. This profiling will also track the time spent in serializing the data.
By default, blaze servers will not allow profiling. To enable profiling on the
blaze server, pass allow_profiler=True to the
Server object. Now when we try to compute against
this server, we may pass profile=True to compute. For example:
>>> client = Client(...)
>>> compute(expr, client, profile=True)
After running the above code, the server will have written a new pstats file
containing the results of the run. This fill will be found at:
profiler_output/<md5>/<timestamp>. We use the md5 hash of the str of the
expression so that users can more easily track down their stats
information. Users can find the hash of their expression with
expr_md5().
The profiler output directory may be configured with the profiler_output
argument to the Server.
Clients may also request that the profiling data be sent back in the response so that analysis may happen on the client. To do this, we change our call to compute to look like:
>>> from io import BytesIO
>>> buf = BytesIO()
>>> compute(expr, client, profile=True, profiler_output=buf)
After that computation, buf will have the the marshalled stats data suitable
for reading with pstats. This feature is useful when blaze servers are
being run behind a load balancer and we do not want to search all of the servers
to find the output.
Note
Because the data is serialized with marshal it must be read by the
same version of python as the server. This means that a python 2 client
cannot unmarshal the data written by a python 3 server. This is to conform
with the file format expected by pstats, the standard profiling output
inspection library.
System administrators may also configure all computations to be profiled by
default. This is useful if the client code cannot be easily changed or threading
arguments to compute is hard in an application setting. This may be set with
profile_by_default=True when constructing the server.
Conclusion¶
Because this process builds off Blaze expressions it works equally well for data stored in any format on which Blaze is trained, including in-memory DataFrames, SQL/Mongo databases, or even Spark clusters.