Data Selection#
As discussed in the Introduction to Data Structures, there are two key data structures in Meerkat: the Column and the DataPanel. In this guide, we’ll demonstrate how to access the data stored within them.
Throughout, we’ll be selecting data from the following DataPanel, which holds the Imagenette dataset, a small subset of the original ImageNet. This DataPanel includes a column holding images, a column holding their labels, and a few others.
In [1]: import meerkat as mk
In [2]: dp = mk.datasets.get("imagenette")
Downloading https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz to /home/docs/.meerkat/datasets/imagenette/imagenette2-160.tgz
Extracting tar archive, this may take a few minutes...
In [3]: dp
Out[3]: DataPanel(nrows: 13394, ncols: 13)
img (ImageColumn) | label (PandasSeriesColumn) | label_id (PandasSeriesColumn) | label_idx (PandasSeriesColumn) | split (PandasSeriesColumn) | img_path (PandasSeriesColumn) | |
---|---|---|---|---|---|---|
0 | cassette player | n02979186 | 482 | train | train/n02979186/n02979186_9036.JPEG | |
1 | cassette player | n02979186 | 482 | train | train/n02979186/n02979186_11957.JPEG | |
2 | cassette player | n02979186 | 482 | train | train/n02979186/n02979186_9715.JPEG | |
3 | cassette player | n02979186 | 482 | train | train/n02979186/n02979186_21736.JPEG | |
4 | cassette player | n02979186 | 482 | train | train/n02979186/ILSVRC2012_val_00046953.JPEG |
Selecting Columns#
The columns in a DataPanel are uniquely identified by str
names. The code
below displays the column names in the Imagenette datapanel we loaded above:
In [4]: dp.columns
Out[4]:
['path',
'noisy_labels_0',
'noisy_labels_1',
'noisy_labels_5',
'noisy_labels_25',
'noisy_labels_50',
'is_valid',
'label_id',
'label',
'label_idx',
'split',
'img_path',
'img']
Using these column names, we can pull out an individual column or a subset of them as a new DataPanel.
Selecting a Single Column: str
-> AbstractColumn
To select a single column, we simply pass it’s name to the index operator. For example,
In [5]: col = dp["label"]
In [6]: col
Out[6]: PandasSeriesColumn(0 cass... dtype: object)
Passing a str
that isn’t among the column names will raise a KeyError
.
It may be helpful to think of a DataPanel as a dictionary mapping column names to columns.
Indeed, a DataPanel implements other parts of the dict
interface including keys()
, values()
, and items()
. Unlike a dictionary, multiple columns in a DataPanel can be selected at once.
Selecting Multiple Columns: Sequence[str]
-> DataPanel
You can select multiple columns by passing a list or tuple of column names. Doing so will return a new DataPanel with a subset of the columns in the original. For example,
In [7]: new_dp = dp[["label", "img"]]
In [8]: new_dp.columns
Out[8]: ['label', 'img']
Passing a str
that isn’t among the column names will raise a KeyError
.
Copy vs. Reference
See Copy vs. View Behavior for more information.
You may be wondering whether the columns returned by indexing are copies of the columns in the original DataPanel. The columns returned by the index operator reference the same columns in the original DataPanel. This means that modifying the columns returned by the index operator will modify the columns in the original DataPanel.
Selecting Rows#
In Meerkat, the rows of a DataPanel or Column are ordered. This means that rows are uniquely identified by their position in the DataPanel or Column (similar to how the elements of a Python List are uniquely identified by their position in the list).
Row indices range from 0 to the number of rows in the DataPanel or Column minus one. To
see how many rows a DataPanel or a column has we can use len()
. For example,
In [9]: len(dp)
Out[9]: 13394
Above we mentioned how a DataPanel could be viewed as a dictionary mapping column names to columns. Equivalently, it also may be helpful to think of a DataPanel as a list of dictionaries mapping column names to values. The DataPanel interface supports both of these views – under the hood, storage is organized so as to make both column and row accesses fast.
Selecting a Single Row from a DataPanel: int
-> Dict[str, Any]
To select a single row from a DataPanel, we simply pass it’s position to the index operator. For example,
In [10]: row = dp[2]
In [11]: row
Out[11]:
{'path': 'train/n02979186/n02979186_9715.JPEG',
'noisy_labels_0': 'n02979186',
'noisy_labels_1': 'n02979186',
'noisy_labels_5': 'n02979186',
'noisy_labels_25': 'n03417042',
'noisy_labels_50': 'n03000684',
'is_valid': False,
'label_id': 'n02979186',
'label': 'cassette player',
'label_idx': 482,
'split': 'train',
'img_path': 'train/n02979186/n02979186_9715.JPEG',
'img': <PIL.Image.Image image mode=RGB size=160x216>}
Passing an int
that is less than 0
or greater than len(dp)
will raise an IndexError
.
Notice how row
contains a full PIL Image.
With thousands of images in the dataset, it wouldn’t make sense to hold all the images in memory.
Instead, images are only loaded into memory at the moment they are selected.
Lazy Selection
What if we want to select a row without loading the image into memory? Meerkat supports lazy selection through the lz
indexer.
In [12]: row = dp.lz[2]
In [13]: row
Out[13]:
{'path': 'train/n02979186/n02979186_9715.JPEG',
'noisy_labels_0': 'n02979186',
'noisy_labels_1': 'n02979186',
'noisy_labels_5': 'n02979186',
'noisy_labels_25': 'n03417042',
'noisy_labels_50': 'n03000684',
'is_valid': False,
'label_id': 'n02979186',
'label': 'cassette player',
'label_idx': 482,
'split': 'train',
'img_path': 'train/n02979186/n02979186_9715.JPEG',
'img': FileCell.(.../n02979186/n02979186_9715.JPEG, transform=None)}
Notice that instead of holding the image in memory, row
holds a FileCell
object.
This object knows how to load the image into memory, but stops just short of doing so. Later on, when we want to access the image, we can use the :meth:``~meerkat.FileCell.get` method on the cell. For example,
In [14]: row["img"].get()
Out[14]: <PIL.Image.Image image mode=RGB size=160x216>
Lazy selection is critical for manipulating and managing DataPanels in Meerkat. It is discussed in more detail in the guide on Lambda Columns and Lazy Selection.
The same position-based indexing works for selecting a single cell from a Column.
Selecting a Single Cell from a Column: int
-> Any
To select a single cell from a column, we pass it’s position to the index operator. For example,
In [15]: col = dp["label"]
In [16]: col[2]
Out[16]: 'cassette player'
Passing an int
that is less than 0
or greater than len(dp["label"])
will raise an IndexError
.
There are three different ways to select a subset of rows from a DataPanel: via slice
, Sequence[int]
, or Sequence[bool]
.
Selecting Multiple Rows from a DataPanel: slice
-> DataPanel
To select a set of contiguous rows from a DataPanel, we can use an integer slice [start:end]
.
The subset of rows will be returned as a new DataPanel.
In [17]: new_dp = dp[50:100]
In [18]: new_dp
Out[18]: DataPanel(nrows: 50, ncols: 13)
We can also use integer slices to select a set of evenly spaced rows from a DataPanel [start:end:step]
. For example, below we select everyt tenth row from the first 100 rows in the DataPanel.
In [19]: new_dp = dp[0:100:10]
In [20]: new_dp
Out[20]: DataPanel(nrows: 10, ncols: 13)
Selecting Multiple Rows from a DataPanel: Sequence[int]
-> DataPanel
To select multiple rows from a DataPanel we can also pass a list of int
.
In [21]: small_dp = dp[[0, 2, 5, 8, 17]]
In [22]: small_dp
Out[22]: DataPanel(nrows: 5, ncols: 13)
Other valid sequences of int
that can be used to index are:
Tuple[int]
– a tuple of integers.np.ndarray[np.integer]
- a NumPy NDArray with dtype np.integer.pd.Series[np.integer]
- a Pandas Series with dtype np.integer.torch.Tensor[torch.int64]
- a PyTorch Tensor with dtype torch.int.mk.AbstractColumn
- a Meerkat column who’s cells areint
,np.integer
, ortorch.int64
.
This is useful when the rows are neither coontiguous nor evenly spaced (otherwise slice indexing, described above, is faster).
Selecting Multiple Rows from a DataPanel: Sequence[bool]
-> DataPanel
To select multiple rows from a DataPanel we can also pass a list of bool
the
same length as the DataPanel. Below we select the first and last rows from
the smaller DataPanel small_dp
that we selected in the panel above.
In [23]: small_dp[[True, False, False, False, True]]
Out[23]: DataPanel(nrows: 2, ncols: 13)
Other valid sequences of bool
that can be used to select are:
Tuple[bool]
– a tuple of bool.np.ndarray[bool]
- a NumPy NDArray with dtype bool.pd.Series[bool]
- a Pandas Series with dtype bool.torch.Tensor[torch.bool]
- a PyTorch Tensor with dtype torch.bool.mk.AbstractColumn
- a Meerkat column who’s cells areint
,bool
, ortorch.bool
.
This is very useful for quickly selecting a subset of rows that satisfy a predicate
(like you might do with a WHERE
clause in SQL).
For example, say we want to select all rows that have a value of "parachute"
in
the "label"
column. We could do this using the following code:
In [24]: small_dp.lz[small_dp["label"] == "parachute"]
Out[24]: DataPanel(nrows: 0, ncols: 13)
Copy vs. Reference
See Copy vs. View Behavior for more information.
You may be wondering whether the rows returned by indexing are copies or references of the rows in the original DataPanel.
This depends on (1) which of the selection strategies above you use (slice
vs. Sequence[int]
vs. Sequence[bool]
) and (2) the column type (e.g. PandasSeriesColumn
, NumpyArrayColumn
).
In general, columns inherit the copying behavior of their underlying data structure.
For example, a NumpyArrayColumn
has the copying behavior of a NumPy array, as described in the Numpy indexing documentation.
See a more detailed discussion in Copy vs. View Behavior.
For Pandas Users
.iloc
and .loc
:
Pandas users are likely familiar with .iloc
and .loc
properties of DataFrames and Series.
These properties are used to select data by integer position and by label in the index, respectively.In Meerkat, DataPanels and Columns do not have a designated index object as do DataFrames and Series. In Meerkat, the primary way to select rows in Meerkat is by integer position or boolean mask, so there is no need for distinct .iloc
and loc
indexers.
Indexing Cells:
In Pandas, it’s possible to select a cell directly from a DataFrame with a single index like df.loc[2, "label"]
.
This is not supported in Meerkat. Instead you should chain the indexing operators together. For example,
dp["label"][2]
. In general, you should index the column first and then the row. Doing it in the reverse order
could be wasteful, since the other cells in the row would be loaded for no reason.