Introduction to Data Structures#

Meerkat provides two data structures, the column and the datapanel, that together help you build, manage, and explore machine learning datasets . Everything you do with Meerkat will involve one or both of these data structures, so we begin this user guide with their high-level introduction.

Column#

A column is a sequential data structure (analagous to a Series in Pandas or a Vector in R). Meerkat supports a diverse set of column types (e.g. NumpyArrayColumn, ImageColumn), each intended for different kinds of data. To see a list of the core column types and their capabilities, see Overview of Column Types.

Below we create a simple column to hold a set of images stored on disk. To create it, we simply pass filepaths to the ImageColumn constructor.

In [1]: import meerkat as mk

In [2]: img_col = mk.ImageColumn(
   ...:     ["img_0.jpg", "img_1.jpg", "img_2.jpg"],
   ...:     base_dir=abs_path_to_img_dir
   ...: )
   ...: 

In [3]: img_col
Out[3]: ImageColumn(PandasSeriesC...dtype: object))
(ImageColumn)
0
1
2

All Meerkat columns are subclasses of AbstractColumn and share a common interface, which includes __len__(), __getitem__(), __setitem__(), filter(), map(), and concat(). Below we get the length of the column we just created.

In [4]: len(img_col)
Out[4]: 3

Certain column types may expose additional functionality. For example, NumpyArrayColumn inherits most of the functionality of an ndarray.

In [5]: id_col = mk.NumpyArrayColumn([0, 1, 2])

In [6]: id_col.sum()
Out[6]: 3

In [7]: id_col == 1
Out[7]: NumpyArrayColumn(array([False,  True, False]))

To see the full list of methods available to a column type,

If you don’t know which column type to use, you can just pass a familiar data structure like a list, np.ndarray, pd.Series, and torch.Tensor to from_data() and Meerkat will automatically pick an appropriate column type.

In [8]: import torch

In [9]: tensor = torch.tensor([1,2,3])

In [10]: mk.AbstractColumn.from_data(tensor)
Out[10]: TensorColumn(tensor([1, 2, 3]))

DataPanel#

A DataPanel is a collection of equal-length columns (analagous to a DataFrame in Pandas or R). DataPanels in Meerkat are used to manage datasets and per-example artifacts (e.g. model predictions and embeddings).

Below we combine the columns we created above into a single DataPanel. We also add an additional column containing labels for the images. Note that we can pass non-Meerkat data structures like list, np.ndarray, pd.Series, and torch.Tensor directly to the DataPanel constructor and Meerkat will infer the column type. We do not need to first convert to a Meerkat column.

In [11]: dp = mk.DataPanel(
   ....:     {
   ....:         "img": img_col,
   ....:         "label": ["boombox", "truck", "dog"],
   ....:         "id": id_col,
   ....:     }
   ....: )
   ....: 

In [12]: dp
Out[12]: DataPanel(nrows: 3, ncols: 3)
img (ImageColumn) label (PandasSeriesColumn) id (NumpyArrayColumn)
0 boombox 0
1 truck 1
2 dog 2

Read on to learn how we access the data in Columns and DataPanels.