meerkat.columns package

Submodules

meerkat.columns.abstract module

class AbstractColumn(data: Optional[Sequence] = None, collate_fn: Optional[Callable] = None, formatter: Optional[Callable] = None, *args, **kwargs)[source]

Bases: meerkat.mixins.blockable.BlockableMixin, meerkat.mixins.cloneable.CloneableMixin, meerkat.mixins.collate.CollateMixin, meerkat.mixins.io.ColumnIOMixin, meerkat.mixins.inspect_fn.FunctionInspectorMixin, meerkat.mixins.lambdable.LambdaMixin, meerkat.mixins.mapping.MappableMixin, meerkat.mixins.materialize.MaterializationMixin, meerkat.provenance.ProvenanceMixin, abc.ABC

An abstract class for Meerkat columns.

append(column: meerkat.columns.abstract.AbstractColumn) None[source]
argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.AbstractColumn[source]

Return indices that would sorted the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

AbstractColumn

batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, num_workers: int = 0, materialize: bool = True, *args, **kwargs)[source]

Batch the column.

Parameters
  • batch_size – integer batch size

  • drop_last_batch – drop the last batch if its smaller than batch_size

  • collate – whether to collate the returned batches

Returns

batches of data

static concat(columns: Sequence[meerkat.columns.abstract.AbstractColumn]) None[source]
filter(function: Callable, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: Optional[int] = 0, materialize: bool = True, pbar: bool = False, **kwargs) Optional[meerkat.columns.abstract.AbstractColumn][source]

Filter the elements of the column using a function.

classmethod from_data(data: Union[Columnable, AbstractColumn])[source]

Convert data to a meerkat column using the appropriate Column type.

full_length()[source]
classmethod get_writer(mmap: bool = False, template: Optional[meerkat.columns.abstract.AbstractColumn] = None)[source]
head(n: int = 5) meerkat.columns.abstract.AbstractColumn[source]

Get the first n examples of the column.

is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

sample(n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, weights: Optional[Union[str, numpy.ndarray]] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None) meerkat.columns.abstract.AbstractColumn[source]

Select a random sample of rows from Column. Roughly equivalent to sample in Pandas https://pandas.pydata.org/docs/reference/api/panda s.DataFrame.sample.html.

Parameters
  • n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.

  • frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.

  • replace (bool) – Sample with or without replacement. Defaults to False.

  • weights (np.ndarray) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If weights do not sum to 1 they will be normalized to sum to 1.

  • random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.

Returns

A random sample of rows from the DataPanel.

Return type

AbstractColumn

sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.AbstractColumn[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

AbstractColumn

streamlit()[source]
tail(n: int = 5) meerkat.columns.abstract.AbstractColumn[source]

Get the last n examples of the column.

to_pandas() pandas.core.series.Series[source]
Columnable

alias of Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor]

property data

Get the underlying data.

property formatter: Callable
property is_mmap
logdir: pathlib.Path = PosixPath('/home/docs/meerkat')
property metadata

meerkat.columns.arrow_column module

class ArrowArrayColumn(data: Sequence, *args, **kwargs)[source]

Bases: meerkat.columns.abstract.AbstractColumn

block_class

alias of meerkat.block.arrow_block.ArrowBlock

classmethod concat(columns: Sequence[meerkat.columns.arrow_column.ArrowArrayColumn])[source]
is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

to_numpy()[source]
to_pandas()[source]
to_tensor()[source]

meerkat.columns.audio_column module

class AudioColumn(data: Optional[Sequence[str]] = None, transform: Optional[callable] = None, loader: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]

Bases: meerkat.columns.file_column.FileColumn

A lambda column where each cell represents an audio file on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the lz indexer, the images are not materialized and an FileCell or an AudioColumn is returned instead.

Parameters
  • data (Sequence[str]) – A list of filepaths to images.

  • transform (callable) –

    A function that transforms the image (e.g. torchvision.transforms.functional.center_crop).

    Warning

    In order for the column to be serializable, the transform function must be pickleable.

  • loader (callable) –

    A callable with signature def loader(filepath: str) -> PIL.Image:. Defaults to torchvision.datasets.folder.default_loader.

    Warning

    In order for the column to be serializable with write(), the loader function must be pickleable.

  • base_dir (str) – A base directory that the paths in data are relative to. If None, the paths are assumed to be absolute.

collate(batch)[source]

Collate data.

classmethod default_loader(*args, **kwargs)[source]

meerkat.columns.cell_column module

class CellColumn(cells: Optional[Sequence[meerkat.cells.abstract.AbstractCell]] = None, *args, **kwargs)[source]

Bases: meerkat.columns.abstract.AbstractColumn

static concat(columns: Sequence[meerkat.columns.cell_column.CellColumn])[source]
classmethod from_cells(cells: Sequence[meerkat.cells.abstract.AbstractCell], *args, **kwargs)[source]
is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

property cells

meerkat.columns.file_column module

class Downloader(cache_dir: str, downloader: Optional[callable] = None)[source]

Bases: object

class FileCell(transform: Optional[callable] = None, loader: Optional[callable] = None, data: Optional[str] = None, base_dir: Optional[str] = None)[source]

Bases: meerkat.columns.file_column.FileLoaderMixin, meerkat.columns.lambda_column.LambdaCell

property absolute_path
class FileColumn(data: Optional[Sequence[str]] = None, transform: Optional[callable] = None, loader: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]

Bases: meerkat.columns.file_column.FileLoaderMixin, meerkat.columns.lambda_column.LambdaColumn

A column where each cell represents an file stored on disk or the web. The underlying data is a PandasSeriesColumn of strings, where each string is the path to a file. The column materializes the files into memory when indexed. If the column is lazy indexed with the lz indexer, the files are not materialized and a FileCell or a FileColumn is returned instead.

Parameters
  • data (Sequence[str]) – A list of filepaths to images.

  • transform (callable) –

    A function that transforms the image (e.g. torchvision.transforms.functional.center_crop).

    Warning

    In order for the column to be serializable, the transform function must be pickleable.

  • loader (callable) –

    A callable with signature def loader(filepath: str) -> PIL.Image:. Defaults to torchvision.datasets.folder.default_loader.

    Warning

    In order for the column to be serializable with write(), the loader function must be pickleable.

  • base_dir (str) – A base directory that the paths in data are relative to. If None, the paths are assumed to be absolute.

classmethod default_loader(*args, **kwargs)[source]
classmethod from_filepaths(filepaths: Sequence[str], loader: Optional[callable] = None, transform: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]
is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

class FileLoaderMixin[source]

Bases: object

fn(filepath: str)[source]
download_image(url: str, cache_dir: str)[source]

meerkat.columns.image_column module

class ImageColumn(data: Optional[Sequence[str]] = None, transform: Optional[callable] = None, loader: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]

Bases: meerkat.columns.file_column.FileColumn

A column where each cell represents an image stored on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the lz indexer, the images are not materialized and an ImageCell or an ImageColumn is returned instead.

Parameters
  • data (Sequence[str]) – A list of filepaths to images.

  • transform (callable) –

    A function that transforms the image (e.g. torchvision.transforms.functional.center_crop).

    Warning

    In order for the column to be serializable, the transform function must be pickleable.

  • loader (callable) –

    A callable with signature def loader(filepath: str) -> PIL.Image:. Defaults to torchvision.datasets.folder.default_loader.

    Warning

    In order for the column to be serializable with write(), the loader function must be pickleable.

  • base_dir (str) – A base directory that the paths in data are relative to. If None, the paths are assumed to be absolute.

classmethod default_loader(*args, **kwargs)[source]

meerkat.columns.lambda_column module

class LambdaCell(fn: Optional[callable] = None, data: Optional[any] = None)[source]

Bases: meerkat.cells.abstract.AbstractCell

get(*args, **kwargs)[source]

Get me the thing that this cell exists for.

property data: object

Get the data associated with this cell.

class LambdaColumn(data: Union[meerkat.datapanel.DataPanel, meerkat.columns.abstract.AbstractColumn], fn: Optional[callable] = None, output_type: Optional[type] = None, *args, **kwargs)[source]

Bases: meerkat.columns.abstract.AbstractColumn

static concat(columns: Sequence[meerkat.columns.lambda_column.LambdaColumn])[source]
fn(data: object)[source]

Subclasses like ImageColumn should be able to implement their own version.

is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

meerkat.columns.list_column module

class ListColumn(data: Optional[Sequence] = None, *args, **kwargs)[source]

Bases: meerkat.columns.abstract.AbstractColumn

batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, *args, **kwargs)[source]

Batch the column.

Parameters
  • batch_size – integer batch size

  • drop_last_batch – drop the last batch if its smaller than batch_size

  • collate – whether to collate the returned batches

Returns

batches of data

classmethod concat(columns: Sequence[meerkat.columns.list_column.ListColumn])[source]
default_formatter()
classmethod from_list(data: Sequence)[source]
is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

meerkat.columns.numpy_column module

class NumpyArrayColumn(data: Sequence, *args, **kwargs)[source]

Bases: meerkat.columns.abstract.AbstractColumn, numpy.lib.mixins.NDArrayOperatorsMixin

block_class

alias of meerkat.block.numpy_block.NumpyBlock

argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.numpy_column.NumpyArrayColumn[source]

Return indices that would sorted the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

NumpySeriesColumn

For now! Raises error when shape of input array is more than one error.

classmethod concat(columns: Sequence[meerkat.columns.numpy_column.NumpyArrayColumn])[source]
classmethod from_array(data: numpy.ndarray, *args, **kwargs)[source]
classmethod from_npy(path, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')[source]
classmethod get_writer(mmap: bool = False, template: Optional[meerkat.columns.abstract.AbstractColumn] = None)[source]
is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

sort(ascending: Union[bool, List[bool]] = True, axis: int = - 1, kind: str = 'quicksort', order: Optional[Union[str, List[str]]] = None) meerkat.columns.numpy_column.NumpyArrayColumn[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

AbstractColumn

to_numpy() numpy.ndarray[source]
to_pandas() pandas.core.series.Series[source]
to_tensor() torch.Tensor[source]

Use column.to_tensor() instead of torch.tensor(column), which is very slow.

property is_mmap
getattr_decorator(fn: Callable)[source]

meerkat.columns.pandas_column module

class PandasSeriesColumn(data: Optional[Sequence] = None, collate_fn: Optional[Callable] = None, formatter: Optional[Callable] = None, *args, **kwargs)[source]

Bases: meerkat.columns.abstract.AbstractColumn, numpy.lib.mixins.NDArrayOperatorsMixin

block_class

alias of meerkat.block.pandas_block.PandasBlock

cat

alias of meerkat.columns.pandas_column._MeerkatCategoricalAccessor

dt

alias of meerkat.columns.pandas_column._MeerkatCombinedDatetimelikeProperties

str

alias of meerkat.columns.pandas_column._MeerkatStringMethods

argsort(ascending: bool = True, kind: str = 'quicksort') meerkat.columns.pandas_column.PandasSeriesColumn[source]

Return indices that would sorted the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

PandasSeriesColumn

For now! Raises error when shape of input array is more than one error.

classmethod concat(columns: Sequence[meerkat.columns.pandas_column.PandasSeriesColumn])[source]
classmethod from_array(data: numpy.ndarray, *args, **kwargs)[source]
is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.pandas_column.PandasSeriesColumn[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

AbstractColumn

to_numpy() torch.Tensor[source]
to_pandas() pandas.core.series.Series[source]
to_tensor() torch.Tensor[source]

Use column.to_tensor() instead of torch.tensor(column), which is very slow.

getattr_decorator(fn: Callable)[source]

meerkat.columns.spacy_column module

class SpacyColumn(data: Sequence[spacy_tokens.Doc] = None, *args, **kwargs)[source]

Bases: meerkat.columns.list_column.ListColumn

classmethod from_docs(data: Sequence[spacy_tokens.Doc], *args, **kwargs)[source]
classmethod from_texts(texts: Sequence[str], lang: str = 'en_core_web_sm', *args, **kwargs)[source]
classmethod read(path: str, nlp: spacy.language.Language = None, lang: str = None, *args, **kwargs) SpacyColumn[source]
write(path: str, **kwargs) None[source]
property docs
property tokens

meerkat.columns.tensor_column module

class TensorColumn(data: Optional[Sequence] = None, *args, **kwargs)[source]

Bases: numpy.lib.mixins.NDArrayOperatorsMixin, meerkat.columns.abstract.AbstractColumn

block_class

alias of meerkat.block.tensor_block.TensorBlock

argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor_column.TensorColumn[source]

Return indices that would sorted the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

TensorColumn

For now! Raises error when shape of input array is more than one error.

classmethod concat(columns: Sequence[meerkat.columns.tensor_column.TensorColumn])[source]
classmethod from_data(data: Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor, meerkat.columns.abstract.AbstractColumn])[source]

Convert data to an EmbeddingColumn.

classmethod get_writer(mmap: bool = False, template: Optional[meerkat.columns.abstract.AbstractColumn] = None)[source]
is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor_column.TensorColumn[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

AbstractColumn

to_numpy() pandas.core.series.Series[source]
to_pandas() pandas.core.series.Series[source]
to_tensor() torch.Tensor[source]
getattr_decorator(fn: Callable)[source]

meerkat.columns.volume_column module

class MedicalVolumeColumn(*args, **kwargs)[source]

Bases: meerkat.columns.cell_column.CellColumn

classmethod from_filepaths(filepaths: Optional[Sequence[str]] = None, loader: Optional[callable] = None, transform: Optional[callable] = None, *args, **kwargs)[source]

Module contents