Data Selection#

As discussed in the Introduction to Data Structures, there are two key data structures in Meerkat: the Column and the DataPanel. In this guide, we’ll demonstrate how to access the data stored within them.

Throughout, we’ll be selecting data from the following DataPanel, which holds the Imagenette dataset, a small subset of the original ImageNet. This DataPanel includes a column holding images, a column holding their labels, and a few others.

In [1]: import meerkat as mk

In [2]: dp = mk.datasets.get("imagenette")
Downloading https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz to /home/docs/.meerkat/datasets/imagenette/imagenette2-160.tgz
Extracting tar archive, this may take a few minutes...

In [3]: dp
Out[3]: DataPanel(nrows: 13394, ncols: 13)
img (ImageColumn) label (PandasSeriesColumn) label_id (PandasSeriesColumn) label_idx (PandasSeriesColumn) split (PandasSeriesColumn) img_path (PandasSeriesColumn)
0 cassette player n02979186 482 train train/n02979186/n02979186_9036.JPEG
1 cassette player n02979186 482 train train/n02979186/n02979186_11957.JPEG
2 cassette player n02979186 482 train train/n02979186/n02979186_9715.JPEG
3 cassette player n02979186 482 train train/n02979186/n02979186_21736.JPEG
4 cassette player n02979186 482 train train/n02979186/ILSVRC2012_val_00046953.JPEG

Selecting Columns#

The columns in a DataPanel are uniquely identified by str names. The code below displays the column names in the Imagenette datapanel we loaded above:

In [4]: dp.columns
Out[4]: 
['path',
 'noisy_labels_0',
 'noisy_labels_1',
 'noisy_labels_5',
 'noisy_labels_25',
 'noisy_labels_50',
 'is_valid',
 'label_id',
 'label',
 'label_idx',
 'split',
 'img_path',
 'img']

Using these column names, we can pull out an individual column or a subset of them as a new DataPanel.

Selecting a Single Column: str -> AbstractColumn

To select a single column, we simply pass it’s name to the index operator. For example,

In [5]: col = dp["label"]

In [6]: col
Out[6]: PandasSeriesColumn(0        cass... dtype: object)

Passing a str that isn’t among the column names will raise a KeyError.

It may be helpful to think of a DataPanel as a dictionary mapping column names to columns. Indeed, a DataPanel implements other parts of the dict interface including keys(), values(), and items(). Unlike a dictionary, multiple columns in a DataPanel can be selected at once.

Selecting Multiple Columns: Sequence[str] -> DataPanel

You can select multiple columns by passing a list or tuple of column names. Doing so will return a new DataPanel with a subset of the columns in the original. For example,

In [7]: new_dp = dp[["label", "img"]]

In [8]: new_dp.columns
Out[8]: ['label', 'img']

Passing a str that isn’t among the column names will raise a KeyError.

Copy vs. Reference

See Copy vs. View Behavior for more information.

You may be wondering whether the columns returned by indexing are copies of the columns in the original DataPanel. The columns returned by the index operator reference the same columns in the original DataPanel. This means that modifying the columns returned by the index operator will modify the columns in the original DataPanel.

Selecting Rows#

In Meerkat, the rows of a DataPanel or Column are ordered. This means that rows are uniquely identified by their position in the DataPanel or Column (similar to how the elements of a Python List are uniquely identified by their position in the list).

Row indices range from 0 to the number of rows in the DataPanel or Column minus one. To see how many rows a DataPanel or a column has we can use len(). For example,

In [9]: len(dp)
Out[9]: 13394

Above we mentioned how a DataPanel could be viewed as a dictionary mapping column names to columns. Equivalently, it also may be helpful to think of a DataPanel as a list of dictionaries mapping column names to values. The DataPanel interface supports both of these views – under the hood, storage is organized so as to make both column and row accesses fast.

Selecting a Single Row from a DataPanel: int -> Dict[str, Any]

To select a single row from a DataPanel, we simply pass it’s position to the index operator. For example,

In [10]: row = dp[2]

In [11]: row
Out[11]: 
{'path': 'train/n02979186/n02979186_9715.JPEG',
 'noisy_labels_0': 'n02979186',
 'noisy_labels_1': 'n02979186',
 'noisy_labels_5': 'n02979186',
 'noisy_labels_25': 'n03417042',
 'noisy_labels_50': 'n03000684',
 'is_valid': False,
 'label_id': 'n02979186',
 'label': 'cassette player',
 'label_idx': 482,
 'split': 'train',
 'img_path': 'train/n02979186/n02979186_9715.JPEG',
 'img': <PIL.Image.Image image mode=RGB size=160x216>}

Passing an int that is less than 0 or greater than len(dp) will raise an IndexError.

Notice how row contains a full PIL Image. With thousands of images in the dataset, it wouldn’t make sense to hold all the images in memory. Instead, images are only loaded into memory at the moment they are selected.

Lazy Selection

What if we want to select a row without loading the image into memory? Meerkat supports lazy selection through the lz indexer.

In [12]: row = dp.lz[2]

In [13]: row
Out[13]: 
{'path': 'train/n02979186/n02979186_9715.JPEG',
 'noisy_labels_0': 'n02979186',
 'noisy_labels_1': 'n02979186',
 'noisy_labels_5': 'n02979186',
 'noisy_labels_25': 'n03417042',
 'noisy_labels_50': 'n03000684',
 'is_valid': False,
 'label_id': 'n02979186',
 'label': 'cassette player',
 'label_idx': 482,
 'split': 'train',
 'img_path': 'train/n02979186/n02979186_9715.JPEG',
 'img': FileCell.(.../n02979186/n02979186_9715.JPEG, transform=None)}

Notice that instead of holding the image in memory, row holds a FileCell object. This object knows how to load the image into memory, but stops just short of doing so. Later on, when we want to access the image, we can use the :meth:``~meerkat.FileCell.get` method on the cell. For example,

In [14]: row["img"].get()
Out[14]: <PIL.Image.Image image mode=RGB size=160x216>

Lazy selection is critical for manipulating and managing DataPanels in Meerkat. It is discussed in more detail in the guide on Lambda Columns and Lazy Selection.

The same position-based indexing works for selecting a single cell from a Column.

Selecting a Single Cell from a Column: int -> Any

To select a single cell from a column, we pass it’s position to the index operator. For example,

In [15]: col = dp["label"]

In [16]: col[2]
Out[16]: 'cassette player'

Passing an int that is less than 0 or greater than len(dp["label"]) will raise an IndexError.

There are three different ways to select a subset of rows from a DataPanel: via slice, Sequence[int], or Sequence[bool].

Selecting Multiple Rows from a DataPanel: slice -> DataPanel

To select a set of contiguous rows from a DataPanel, we can use an integer slice [start:end]. The subset of rows will be returned as a new DataPanel.

In [17]: new_dp = dp[50:100]

In [18]: new_dp
Out[18]: DataPanel(nrows: 50, ncols: 13)

We can also use integer slices to select a set of evenly spaced rows from a DataPanel [start:end:step]. For example, below we select everyt tenth row from the first 100 rows in the DataPanel.

In [19]: new_dp = dp[0:100:10]

In [20]: new_dp
Out[20]: DataPanel(nrows: 10, ncols: 13)

Selecting Multiple Rows from a DataPanel: Sequence[int] -> DataPanel

To select multiple rows from a DataPanel we can also pass a list of int.

In [21]: small_dp = dp[[0, 2, 5, 8, 17]]

In [22]: small_dp
Out[22]: DataPanel(nrows: 5, ncols: 13)

Other valid sequences of int that can be used to index are:

  • Tuple[int] – a tuple of integers.

  • np.ndarray[np.integer] - a NumPy NDArray with dtype np.integer.

  • pd.Series[np.integer] - a Pandas Series with dtype np.integer.

  • torch.Tensor[torch.int64] - a PyTorch Tensor with dtype torch.int.

  • mk.AbstractColumn - a Meerkat column who’s cells are int, np.integer, or torch.int64.

This is useful when the rows are neither coontiguous nor evenly spaced (otherwise slice indexing, described above, is faster).

Selecting Multiple Rows from a DataPanel: Sequence[bool] -> DataPanel

To select multiple rows from a DataPanel we can also pass a list of bool the same length as the DataPanel. Below we select the first and last rows from the smaller DataPanel small_dp that we selected in the panel above.

In [23]: small_dp[[True, False, False, False, True]]
Out[23]: DataPanel(nrows: 2, ncols: 13)

Other valid sequences of bool that can be used to select are:

  • Tuple[bool] – a tuple of bool.

  • np.ndarray[bool] - a NumPy NDArray with dtype bool.

  • pd.Series[bool] - a Pandas Series with dtype bool.

  • torch.Tensor[torch.bool] - a PyTorch Tensor with dtype torch.bool.

  • mk.AbstractColumn - a Meerkat column who’s cells are int, bool, or torch.bool.

This is very useful for quickly selecting a subset of rows that satisfy a predicate (like you might do with a WHERE clause in SQL). For example, say we want to select all rows that have a value of "parachute" in the "label" column. We could do this using the following code:

In [24]: small_dp.lz[small_dp["label"] == "parachute"]
Out[24]: DataPanel(nrows: 0, ncols: 13)

Copy vs. Reference

See Copy vs. View Behavior for more information.

You may be wondering whether the rows returned by indexing are copies or references of the rows in the original DataPanel. This depends on (1) which of the selection strategies above you use (slice vs. Sequence[int] vs. Sequence[bool]) and (2) the column type (e.g. PandasSeriesColumn, NumpyArrayColumn).

In general, columns inherit the copying behavior of their underlying data structure. For example, a NumpyArrayColumn has the copying behavior of a NumPy array, as described in the Numpy indexing documentation. See a more detailed discussion in Copy vs. View Behavior.

For Pandas Users

.iloc and .loc: Pandas users are likely familiar with .iloc and .loc properties of DataFrames and Series. These properties are used to select data by integer position and by label in the index, respectively.In Meerkat, DataPanels and Columns do not have a designated index object as do DataFrames and Series. In Meerkat, the primary way to select rows in Meerkat is by integer position or boolean mask, so there is no need for distinct .iloc and loc indexers.

Indexing Cells: In Pandas, it’s possible to select a cell directly from a DataFrame with a single index like df.loc[2, "label"]. This is not supported in Meerkat. Instead you should chain the indexing operators together. For example, dp["label"][2]. In general, you should index the column first and then the row. Doing it in the reverse order could be wasteful, since the other cells in the row would be loaded for no reason.