Lambda Columns and Lazy Selection

Lambda Columns

If you check out the implementation of ImageColumn, you’ll notice that it’s a super simple subclass of LambdaColumn.

What’s a LambdaColumn? In Meerkat, high-dimensional data types like images and videos are typically stored in a LambdaColumn. A LambdaColumn wraps around another column and applies a function to it’s content as it is indexed.

Consider the following example, where we create a simple Meerkat column…

In [1]: import meerkat as mk

In [2]: col = mk.NumpyArrayColumn([0,1,2])

In [3]: col[1]
Out[3]: 1

…and wrap it in a lambda column.

In [4]: lambda_col = col.to_lambda(function=lambda x: x + 10)

In [5]: lambda_col[1]  # the function is only called at this point!
Out[5]: 11

Critically, the function inside a lambda column is only called at the time the column is indexed! This is very useful for columns with large data types that we don’t want to load all into memory at once. For example, we could create a LambdaColumn that lazily loads images…

In [6]: from PIL import Image

In [7]: dp = mk.DataPanel(
   ...:     {
   ...:         "filepath": ["/abs/path/to/image0.jpg", ...],
   ...:         "image_id": ["image0", ...]
   ...:     }
   ...: )
   ...: 

In [8]: dp["image"] = dp["filepath"].to_lambda(fn=Image.open)

Notice how we provide an absolute path to the images. This makes the column useable from any working directory. However, using absolute paths is in other ways not ideal: what if we want to share the DataPanel and open it on a different machine? In the section below, we discuss a subclass of LambdaColumn that makes it easy to manage filepaths.

FileColumn

As discussed above, FileColumn, a simple subclass of LambdaColumn.

The FileColumn constructor takes an additional argument, base_dir, which is the base directory from which all file paths are relative. When base_dir is provided, the paths passed to filepaths should be relative to base_dir:

In [9]: from PIL import Image

In [10]: dp = mk.DataPanel(
   ....:     {
   ....:         "filepath": ["image0.jpg", ...],
   ....:         "image_id": ["image0", ...]
   ....:     }
   ....: )
   ....: 

In [11]: dp["image"] = mk.FileColumn.from_filepaths(
   ....:     filepaths=dp["filepath"],
   ....:     loader=Image.open,
   ....:     base_dir="/abs/path/to",
   ....: )
   ....: 

The base_dir can then be changed at any time, so if we wanted to share the DataPanel with another user, we could instruct them to reset the base_dir using dp["image"].base_dir = "/other/users/abs/path/to". Introducing this additional step isn’t ideal though, so we recommend using the environment variables technique as described below.

Using Environment Variables in base_dir

Environment variables in the base_dir argument are automatically expanded. For example, if you set the environment variable MEERKAT_BASE_DIR to "/abs/path/to", then you can use dp["image"].base_dir = "$MEERKAT_BASE_DIR/path/to". This is ideal for sharing DataPanels between different users and machines.

Note that the Meerkat dataset registry relies heavily on this technique, using a special environment variable MEERKAT_DATASET_DIR that points to the mk.config.datasets.root_dir.

An ImageColumn is a just a FileColumn like this one, with a few more bells and whistles!

Lazy Selection

Todo

Fill in this stub.