Flatten Transform¶
The flatten transform can be used to extract the contents of arrays from data entries. This will not generally be useful for well-structured data within pandas dataframes, but it can be useful for working with data from other sources.
As an example, consider this dataset which uses a common convention in JSON data, a set of fields each containing a list of entries:
import numpy as np
rand = np.random.RandomState(0)
def generate_data(N):
mean = rand.randn()
std = rand.rand()
return list(rand.normal(mean, std, N))
data = [
{'label': 'A', 'values': generate_data(20)},
{'label': 'B', 'values': generate_data(30)},
{'label': 'C', 'values': generate_data(40)},
{'label': 'D', 'values': generate_data(50)},
]
This kind of data structure does not work well in the context of dataframe representations, as we can see by loading this into pandas:
import pandas as pd
df = pd.DataFrame.from_records(data)
df
label values
0 A [2.005252455842496, 0.3967871813856627, 2.5678...
1 B [1.1906228762083413, -1.6927165224630425, -0.5...
2 C [0.3901956756272385, 1.4135072065946024, 0.603...
3 D [1.0035211072316703, 1.1414240499680273, 1.883...
Alair’s flatten transform allows you to extract the contents of these arrays into a column that can be referenced by an encoding:
import altair as alt
alt.Chart(df).transform_flatten(
['values']
).mark_tick().encode(
x='values:Q',
y='label:N',
)
This can be particularly useful in cleaning up data specified via a JSON URL, without having to first load the data for manipulation in pandas.
Transform Options¶
The transform_flatten()
method is built on the FlattenTransform
class, which has the following options:
Property |
Type |
Description |
---|---|---|
as |
array( |
The output field names for extracted array values. Default value: The field name of the corresponding array field |
flatten |
array( |
An array of one or more data fields containing arrays to flatten.
If multiple fields are specified, their array values should have a parallel structure, ideally with the same length.
If the lengths of parallel arrays do not match,
the longest array will be used with |