Introduction to Data Structures#
Meerkat provides two data structures, the column and the datapanel, that together help you build, manage, and explore machine learning datasets . Everything you do with Meerkat will involve one or both of these data structures, so we begin this user guide with their high-level introduction.
Column#
A column is a sequential data structure (analagous to a Series in Pandas or a Vector in R).
Meerkat supports a diverse set of column types (e.g. NumpyArrayColumn
,
ImageColumn
), each intended for different kinds of data. To see a
list of the core column types and their capabilities, see Overview of Column Types.
Below we create a simple column to hold a set of images stored on disk. To create it,
we simply pass filepaths to the ImageColumn
constructor.
In [1]: import meerkat as mk
In [2]: img_col = mk.ImageColumn(
...: ["img_0.jpg", "img_1.jpg", "img_2.jpg"],
...: base_dir=abs_path_to_img_dir
...: )
...:
In [3]: img_col
Out[3]: ImageColumn(PandasSeriesC...dtype: object))
(ImageColumn) | |
---|---|
0 | |
1 | |
2 |
All Meerkat columns are subclasses of AbstractColumn
and share a common
interface, which includes __len__()
, __getitem__()
, __setitem__()
, filter()
, map()
, and concat()
. Below we get the length of the column we just created.
In [4]: len(img_col)
Out[4]: 3
Certain column types may expose additional functionality. For example,
NumpyArrayColumn
inherits most of the functionality of an
ndarray.
In [5]: id_col = mk.NumpyArrayColumn([0, 1, 2])
In [6]: id_col.sum()
Out[6]: 3
In [7]: id_col == 1
Out[7]: NumpyArrayColumn(array([False, True, False]))
To see the full list of methods available to a column type,
If you don’t know which column type to use, you can just pass a familiar data
structure like a list
, np.ndarray
, pd.Series
, and torch.Tensor
to
from_data()
and Meerkat will automatically pick an
appropriate column type.
In [8]: import torch
In [9]: tensor = torch.tensor([1,2,3])
In [10]: mk.AbstractColumn.from_data(tensor)
Out[10]: TensorColumn(tensor([1, 2, 3]))
DataPanel#
A DataPanel
is a collection of equal-length columns (analagous to a DataFrame in Pandas or R).
DataPanels in Meerkat are used to manage datasets and per-example artifacts (e.g. model predictions and embeddings).
Below we combine the columns we created above into a single DataPanel. We also add an
additional column containing labels for the images. Note that we can pass non-Meerkat data
structures like list
, np.ndarray
, pd.Series
, and torch.Tensor
directly to the
DataPanel constructor and Meerkat will infer the column type. We do not need to first
convert to a Meerkat column.
In [11]: dp = mk.DataPanel(
....: {
....: "img": img_col,
....: "label": ["boombox", "truck", "dog"],
....: "id": id_col,
....: }
....: )
....:
In [12]: dp
Out[12]: DataPanel(nrows: 3, ncols: 3)
img (ImageColumn) | label (PandasSeriesColumn) | id (NumpyArrayColumn) | |
---|---|---|---|
0 | boombox | 0 | |
1 | truck | 1 | |
2 | dog | 2 |
Read on to learn how we access the data in Columns and DataPanels.