====== Server ====== Blaze provides uniform access to a variety of common data formats. Blaze Server builds off of this uniform interface to host data remotely through a JSON web API. Setting up a Blaze Server ========================= To demonstrate the use of the Blaze server we serve the iris csv file. .. code-block:: python >>> # Server code, run this once. Leave running. >>> from blaze import * >>> from blaze.utils import example >>> csv = CSV(example('iris.csv')) >>> data(csv).peek() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris-setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa ... Then we host this publicly on port 6363 .. code-block:: python from blaze.server import Server server = Server(csv) server.run(host='0.0.0.0', port=6363) A Server is the following 1. A dataset that blaze understands or dictionary of such datasets 2. A Flask_ app. With this code our machine is now hosting our CSV file through a web-application on port 6363. We can now access our CSV file, through Blaze, as a service from a variety of applications. Serving Data from the Command Line ================================== Blaze ships with a command line tool called ``blaze-server`` to serve up data specified in a YAML file. .. note:: To use the YAML specification feature of Blaze server please install the :mod:`pyyaml` library. This can be done easily with ``conda``: .. code-block:: sh conda install pyyaml YAML Specification ------------------ The structure of the specification file is as follows: .. code-block:: yaml name1: source: path or uri dshape: optional datashape name2: source: path or uri dshape: optional datashape ... nameN: source: path or uri dshape: optional datashape .. note:: When ``source`` is a directory, Blaze will recurse into the directory tree and call ``odo.resource`` on the leaves of the tree. Here's an example specification file: .. code-block:: yaml iriscsv: source: ../examples/data/iris.csv irisdb: source: sqlite:///../examples/data/iris.db accounts: source: ../examples/data/accounts.json.gz dshape: "var * {name: string, amount: float64}" The previous YAML specification will serve the following dictionary: .. code-block:: python >>> from odo import resource >>> resources = { ... 'iriscsv': resource('../examples/data/iris.csv'), ... 'irisdb': resource('sqlite:///../examples/data/iris.db'), ... 'accounts': resource('../examples/data/accounts.json.gz', ... dshape="var * {name: string, amount: float64}") ... } The only required key for each named data source is the ``source`` key, which is passed to ``odo.resource``. You can optionally specify a ``dshape`` parameter, which is passed into ``odo.resource`` along with the ``source`` key. Advanced YAML usage ------------------- If ``odo.resource`` requires extra keyword arguments for a particular resource type and they are provided in the YAML file, these will be forwarded on to the ``resource`` call. If there is an ``imports`` entry for a resource whose value is a list of module or package names, Blaze server will ``import`` each of these modules or packages before calling ``resource``. For example: .. code-block:: yaml name1: source: path or uri dshape: optional datashape kwarg1: extra kwarg kwarg2: etc. name2: source: path or uri imports: ['mod1', 'pkg2'] For this YAML file, Blaze server will pass on ``kwarg1=...`` and ``kwarg2=...`` to the ``resource()`` call for ``name1`` in addition to the ``dshape=...`` keyword argument. Also, before calling ``resource`` on the ``source`` of ``name2``, Blaze server will first execute an ``import mod1`` and ``import pkg2`` statement. Command Line Interface ---------------------- 1. UNIX .. code-block:: shell # YAML file specifying resources to load and optionally their datashape $ cat example.yaml iriscsv: source: ../examples/data/iris.csv irisdb: source: sqlite:///../examples/data/iris.db accounts: source: ../examples/data/accounts.json.gz dshape: "var * {name: string, amount: float64}" # serve data specified in a YAML file and follow symbolic links $ blaze-server example.yaml --follow-links # You can also construct a YAML file from a heredoc to pipe to blaze-server $ cat < @' datadir: source: C:\path\to\data\directory '@ | blaze-server Interacting with the Web Server from the Client =============================================== Computation is now available on this server at ``localhost:6363/compute.json``. To communicate the computation to be done we pass Blaze expressions in JSON format through the request. See the examples below. Fully Interactive Python-to-Python Remote work ---------------------------------------------- The highest level of abstraction and the level that most will probably want to work at is interactively sending computations to a Blaze server process from a client. We can use Blaze server to have one Blaze process control another. Given our iris web server we can use Blaze on the client to drive the server to do work for us .. code-block:: python # Client code, run this in a separate process from the Server >>> from blaze import data, by >>> t = data('blaze://localhost:6363') # doctest: +SKIP >>> t # doctest: +SKIP sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris-setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa ... >>> by(t.species, min=t.petal_length.min(), ... max=t.petal_length.max()) # doctest: +SKIP species max min 0 Iris-virginica 6.9 4.5 1 Iris-setosa 1.9 1.0 2 Iris-versicolor 5.1 3.0 We interact on the client machine through the data object but computations on this object cause communications through the web API, resulting in seemlessly interactive remote computation. The blaze server and client can be configured to support various serialization formats. These formats are exposed in the :mod:`blaze.server` module. The server and client must both be told to use the same serialization format. For example: .. code-block:: python # Server setup. >>> from blaze import Server >>> from blaze.server import msgpack_format, json_format >>> Server(my_data, formats=(msgpack_format, json_format).run() # doctest: +SKIP # Client code, run this in a separate process from the Server >>> from blaze import Client, data >>> from blaze.server import msgpack_format, json_format >>> msgpack_client = data(Client('localhost', msgpack_format)) # doctest: +SKIP >>> json_client = data(Client('localhost', json_format)) # doctest +SKIP In this example, ``msgpack_client`` will make requests to the ``/compute.msgpack`` endpoint and will send and receive data using the msgpack protocol; however, the ``json_client`` will make requests to the ``/compute.json`` endpoint and will send and receive data using the json protocol. Using the Python Requests Library --------------------------------- Moving down the stack, we can interact at the HTTP request level with Blaze serer using the ``requests`` library. .. code-block:: python # Client code, run this in a separate process from the Server >>> import json >>> import requests >>> query = {'expr': {'op': 'sum', ... 'args': [{'op': 'Field', ... 'args': [':leaf', 'petal_length']}]}} >>> r = requests.get('http://localhost:6363/compute.json', ... data=json.dumps(query), ... headers={'Content-Type': 'application/vnd.blaze+json'}) # doctest: +SKIP >>> json.loads(r.content) # doctest: +SKIP {u'data': 563.8000000000004, u'names': ['petal_length_sum'], u'datashape': u'{petal_length_sum: float64}'} Now we use Blaze to generate the query programmatically .. code-block:: python >>> from blaze import symbol >>> from blaze.server import to_tree >>> from pprint import pprint >>> # Build a Symbol like our served iris data >>> dshape = """var * { ... sepal_length: float64, ... sepal_width: float64, ... petal_length: float64, ... petal_width: float64, ... species: string ... }""" # matching schema to csv file >>> t = symbol('t', dshape) >>> expr = t.petal_length.sum() >>> d = to_tree(expr, names={t: ':leaf'}) >>> query = {'expr': d} >>> pprint(query) {'expr': {'args': [{'args': [':leaf', 'petal_length'], 'op': 'Field'}, [0], False], 'op': 'sum'}} Alternatively we build a query to grab a single column .. code-block:: python >>> pprint(to_tree(t.species, names={t: ':leaf'})) {'args': [':leaf', 'species'], 'op': 'Field'} Using ``curl`` -------------- In fact, any tool that is capable of sending requests to a server is able to send computations to a Blaze server. We can use standard command line tools such as ``curl`` to interact with the server:: $ curl \ -H "Content-Type: application/vnd.blaze+json" \ -d '{"expr": {"op": "Field", "args": [":leaf", "species"]}}' \ localhost:6363/compute.json { "data": [ "Iris-setosa", "Iris-setosa", ... ], "datashape": "var * {species: string}", } $ curl \ -H "Content-Type: application/vnd.blaze+json" \ -d '{"expr": {"op": "sum", \ "args": [{"op": "Field", \ "args": [":leaf", "petal_Length"]}]}}' \ localhost:6363/compute.json { "data": 563.8000000000004, "datashape": "{petal_length_sum: float64}", } These queries deconstruct the Blaze expression as nested JSON. The ``":leaf"`` string is a special case pointing to the base data. Constructing these queries can be difficult to do by hand, fortunately Blaze can help you to build them. Adding Data to the Server ------------------------- Data resources can be added to the server from the client by sending a resource URI to the server. The data initially on the server must have a dictionary-like interface to be updated. .. code-block:: python >>> from blaze.utils import example >>> query = {'accounts': example('accounts.csv')} >>> r = requests.get('http://localhost:6363/add', ... data=json.dumps(query), ... headers={'Content-Type': 'application/vnd.blaze+json'}) # doctest: +SKIP Advanced Use ------------ Blaze servers may host any data that Blaze understands from a single integer .. code-block:: python >>> server = Server(1) To a dictionary of several heterogeneous datasets .. code-block:: python >>> server = Server({ ... 'my-dataframe': df, ... 'iris': resource('iris.csv'), ... 'baseball': resource('sqlite:///baseball-statistics.db') ... }) # doctest: +SKIP A variety of hosting options are available through the Flask_ project :: >>> help(server.app.run) # doctest: +SKIP Help on method run in module flask.app: run(self, host=None, port=None, debug=None, **options) method of flask.app.Flask instance Runs the application on a local development server. If the :attr:`debug` flag is set the server will automatically reload for code changes and show a debugger in case an exception happened. ... Caching ------- Caching results on frequently run queries may significantly improve user experience in some cases. One may wrap a Blaze server in a traditional web-based caching system like memcached or use a data centric solution. The Blaze ``CachedDataset`` might be appropriate in some situations. A cached dataset holds a normal dataset and a ``dict`` like object. .. code-block:: python >>> dset = {'my-dataframe': df, ... 'iris': resource('iris.csv'), ... 'baseball': resource('sqlite:///baseball-statistics.db')} # doctest: +SKIP >>> from blaze.cached import CachedDataset # doctest: +SKIP >>> cached = CachedDataset(dset, cache=dict()) # doctest: +SKIP Queries and results executed against a cached dataset are stored in the cache (here a normal Python :class:`dict`) for fast future access. If accumulated results are likely to fill up memory then other, on-disk ``dict``-like structures can be used like Shove_ or Chest_. .. code-block:: python >>> from chest import Chest # doctest: +SKIP >>> cached = CachedDataset(dset, cache=Chest()) # doctest: +SKIP These cached objects can be used anywhere normal objects can be used in Blaze, including an interactive (and now performance cached) ``data`` object .. code-block:: python >>> d = data(cached) # doctest: +SKIP or a Blaze server .. code-block:: python >>> server = Server(cached) # doctest: +SKIP Flask Blueprint --------------- If you would like to use the blaze server endpoints from within another flask application, you can register the blaze API blueprint with your application. For example: .. code-block:: python >>> from blaze.server import api, json_format >>> my_app.register_blueprint(api, data=my_data, formats=(json_format,)) # doctest: +SKIP When registering the API, you must pass the data that the API endpoints will serve. You must also pass an iterable of serialization format objects that the server will respond to. Profiling --------- The blaze server allows users and server administrators to profile computations run on the server. This allows developers to better understand the performance profile of their computations to better tune their queries or the backend code that is executing the query. This profiling will also track the time spent in serializing the data. By default, blaze servers will not allow profiling. To enable profiling on the blaze server, pass ``allow_profiler=True`` to the :class:`~blaze.server.server.Server` object. Now when we try to compute against this server, we may pass ``profile=True`` to ``compute``. For example: .. code-block:: python >>> client = Client(...) # doctest: +SKIP >>> compute(expr, client, profile=True) # doctest: +SKIP After running the above code, the server will have written a new pstats file containing the results of the run. This fill will be found at: ``profiler_output//``. We use the md5 hash of the str of the expression so that users can more easily track down their stats information. Users can find the hash of their expression with :func:`~blaze.server.server.expr_md5`. The profiler output directory may be configured with the ``profiler_output`` argument to the :class:`~blaze.server.server.Server`. Clients may also request that the profiling data be sent back in the response so that analysis may happen on the client. To do this, we change our call to compute to look like: .. code-block:: python >>> from io import BytesIO # doctest: +SKIP >>> buf = BytesIO() # doctest: +SKIP >>> compute(expr, client, profile=True, profiler_output=buf) # doctest: +SKIP After that computation, ``buf`` will have the the marshalled stats data suitable for reading with :mod:`pstats`. This feature is useful when blaze servers are being run behind a load balancer and we do not want to search all of the servers to find the output. .. note:: Because the data is serialized with :mod:`marshal` it must be read by the same version of python as the server. This means that a python 2 client cannot unmarshal the data written by a python 3 server. This is to conform with the file format expected by :mod:`pstats`, the standard profiling output inspection library. System administrators may also configure all computations to be profiled by default. This is useful if the client code cannot be easily changed or threading arguments to compute is hard in an application setting. This may be set with ``profile_by_default=True`` when constructing the server. Conclusion ========== Because this process builds off Blaze expressions it works equally well for data stored in any format on which Blaze is trained, including in-memory DataFrames, SQL/Mongo databases, or even Spark clusters. .. _Flask : http://flask.pocoo.org/docs/0.10/quickstart/#a-minimal-application .. _Shove : https://pypi.python.org/pypi/shove/0.5.6 .. _Chest : https://github.com/mrocklin/chest