DataFrame

Meerkat.

class DataFrame(data: Optional[Union[dict, list]] = None, primary_key: Optional[str] = None, *args, **kwargs)[source]

A collection of equal length columns.

Parameters
  • data (Union[dict, list]) – A dictionary of columns or a list of dictionaries.

  • primary_key (str, optional) – The name of the primary key column. Defaults to None.

property data: meerkat.block.manager.BlockManager

Get the underlying data (excluding invisible rows).

To access underlying data with invisible rows, use _data.

property columns

Column names in the DataFrame.

property primary_key: meerkat.columns.abstract.Column

The column acting as the primary key.

property primary_key_name: str

The name of the column acting as the primary key.

set_primary_key(column: str, inplace: bool = False) meerkat.dataframe.DataFrame[source]

Set the DataFrame’s primary key using an existing column. This is an out-of-place operation. For more information on primary keys, see the User Guide.

Parameters

column (str) – The name of an existing column to set as the primary key.

create_primary_key(column: str)[source]

Create a primary key of contiguous integers.

Parameters

column (str) – The name of the column to create.

property nrows

Number of rows in the DataFrame.

property ncols

Number of rows in the DataFrame.

property shape

Shape of the DataFrame (num_rows, num_columns).

add_column(name: str, data: Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor], overwrite=False) None[source]

Add a column to the DataFrame.

remove_column(column: str) None[source]

Remove a column from the dataset.

append(df: meerkat.dataframe.DataFrame, axis: Union[str, int] = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) meerkat.dataframe.DataFrame[source]

Append a batch of data to the dataset.

example_or_batch must have the same columns as the dataset (regardless of what columns are visible).

head(n: int = 5) meerkat.dataframe.DataFrame[source]

Get the first n examples of the DataFrame.

tail(n: int = 5) meerkat.dataframe.DataFrame[source]

Get the last n examples of the DataFrame.

classmethod from_batch(batch: Dict[str, Union[List, meerkat.columns.abstract.Column]]) meerkat.dataframe.DataFrame[source]

Convert a batch to a Dataset.

classmethod from_batches(batches: Sequence[Dict[str, Union[List, meerkat.columns.abstract.Column]]]) meerkat.dataframe.DataFrame[source]

Convert a list of batches to a dataset.

classmethod from_pandas(df: pandas.core.frame.DataFrame, index: bool = True, primary_key: Optional[str] = None) meerkat.dataframe.DataFrame[source]

Create a Meerkat DataFrame from a Pandas DataFrame.

Warning

In Meerkat, column names must be strings, so non-string column names in the Pandas DataFrame will be converted.

Parameters
  • df – The Pandas DataFrame to convert.

  • index – Whether to include the index of the Pandas DataFrame as a column in the Meerkat DataFrame.

  • primary_key – The name of the column to use as the primary key. If index is True and primary_key is None, the index will be used as the primary key. If index is False, then no primary key will be set. Optional default is None.

Returns

The Meerkat DataFrame.

Return type

DataFrame

classmethod from_arrow(table: pyarrow.lib.Table)[source]

Create a Dataset from a pandas DataFrame.

classmethod from_huggingface(*args, **kwargs)[source]

Load a Huggingface dataset as a DataFrame.

Use this to replace datasets.load_dataset, so

>>> dict_of_datasets = datasets.load_dataset('boolq')

becomes

>>> dict_of_dataframes = DataFrame.from_huggingface('boolq')
classmethod from_csv(filepath: str, primary_key: str = None, *args, **kwargs) meerkat.dataframe.DataFrame[source]

Create a DataFrame from a csv file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
  • filepath (str) – The file path or buffer to load from. Same as pandas.read_csv().

  • *args – Argument list for pandas.read_csv().

  • **kwargs – Keyword arguments forwarded to pandas.read_csv().

Returns

The constructed dataframe.

Return type

DataFrame

classmethod from_feather(filepath: str, primary_key: str = None, columns: Optional[Sequence[str]] = None, use_threads: bool = True, **kwargs) meerkat.dataframe.DataFrame[source]

Create a DataFrame from a feather file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
  • filepath (str) – The file path or buffer to load from. Same as pandas.read_feather().

  • columns (Optional[Sequence[str]]) – The columns to load. Same as pandas.read_feather().

  • use_threads (bool) – Whether to use threads to read the file. Same as pandas.read_feather().

  • **kwargs – Keyword arguments forwarded to pandas.read_feather().

Returns

The constructed dataframe.

Return type

DataFrame

classmethod from_parquet(filepath: str, primary_key: str = None, engine: str = 'auto', columns: Optional[Sequence[str]] = None, **kwargs) meerkat.dataframe.DataFrame[source]

Create a DataFrame from a parquet file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
  • filepath (str) – The file path or buffer to load from. Same as pandas.read_parquet().

  • engine (str) – The parquet engine to use. Same as pandas.read_parquet().

  • columns (Optional[Sequence[str]]) – The columns to load. Same as pandas.read_parquet().

  • **kwargs – Keyword arguments forwarded to pandas.read_parquet().

Returns

The constructed dataframe.

Return type

DataFrame

classmethod from_json(filepath: str, primary_key: str = None, orient: str = 'records', lines: bool = False, **kwargs) meerkat.dataframe.DataFrame[source]

Load a DataFrame from a json file.

By default, data in the JSON file should be a list of dictionaries, each with an entry for each column. This is the orient="records" format. If the data is in a different format in the JSON, you can specify the orient parameter. See pandas.read_json() for more details.

Parameters
  • filepath (str) – The file path or buffer to load from. Same as pandas.read_json().

  • orient (str) – The expected JSON string format. Options are: “split”, “records”, “index”, “columns”, “values”. Same as pandas.read_json().

  • lines (bool) – Whether the json file is a jsonl file. Same as pandas.read_json().

  • **kwargs – Keyword arguments forwarded to pandas.read_json().

Returns

The constructed dataframe.

Return type

DataFrame

to_pandas(index: bool = False, allow_objects: bool = False) pandas.core.frame.DataFrame[source]

Convert a Meerkat DataFrame to a Pandas DataFrame.

Parameters

index (bool) – Use the primary key as the index of the Pandas DataFrame. Defaults to False.

Returns

The constructed dataframe.

Return type

pd.DataFrame

to_arrow() pandas.core.frame.DataFrame[source]

Convert a Meerkat DataFrame to an Arrow Table.

Returns

The constructed table.

Return type

pa.Table

to_csv(filepath: str, engine: str = 'auto')[source]

Save a DataFrame to a csv file.

The engine used to write the csv to disk.

Parameters
  • filepath (str) – The file path to save to.

  • engine (str) – The library to use to write the csv. One of [“pandas”, “arrow”, “auto”]. If “auto”, then the library will be chosen based on the column types.

to_feather(filepath: str, engine: str = 'auto')[source]

Save a DataFrame to a feather file.

The engine used to write the feather to disk.

Parameters
  • filepath (str) – The file path to save to.

  • engine (str) – The library to use to write the feather. One of [“pandas”, “arrow”, “auto”]. If “auto”, then the library will be chosen based on the column types.

to_parquet(filepath: str, engine: str = 'auto')[source]

Save a DataFrame to a parquet file.

The engine used to write the parquet to disk.

Parameters
  • filepath (str) – The file path to save to.

  • engine (str) – The library to use to write the parquet. One of [“pandas”, “arrow”, “auto”]. If “auto”, then the library will be chosen based on the column types.

to_json(filepath: str, lines: bool = False, orient: str = 'records') None[source]

Save a Dataset to a json file.

Parameters
  • filepath (str) – The file path to save to.

  • lines (bool) – Whether to write the json file as a jsonl file.

  • orient (str) – The orientation of the json file. Same as pandas.DataFrame.to_json().

batch(batch_size: int = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, shuffle: bool = False, *args, **kwargs)[source]

Batch the dataset. TODO:

Parameters
  • batch_size – integer batch size

  • drop_last_batch – drop the last batch if its smaller than batch_size

Returns

batches of data

update(function: Optional[Callable] = None, with_indices: bool = False, input_columns: Optional[Union[List[str], str]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, remove_columns: Optional[List[str]] = None, num_workers: int = 0, output_type: Union[type, Dict[str, type]] = None, mmap: bool = False, mmap_path: str = None, materialize: bool = True, pbar: bool = False, **kwargs) meerkat.dataframe.DataFrame[source]

Update the columns of the dataset.

filter(function: Optional[Callable] = None, with_indices=False, input_columns: Optional[Union[List[str], str]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, pbar: bool = False, **kwargs) Optional[meerkat.dataframe.DataFrame][source]

Filter operation on the DataFrame.

sort(by: Union[str, List[str]], ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.dataframe.DataFrame[source]

Sort the DataFrame by the values in the specified columns. Similar to sort_values in pandas.

Parameters
  • by (Union[str, List[str]]) – The columns to sort by.

  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by.Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A sorted view of DataFrame.

Return type

DataFrame

sample(n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, weights: Optional[Union[str, numpy.ndarray]] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None) meerkat.dataframe.DataFrame[source]

Select a random sample of rows from DataFrame. Roughly equivalent to sample in Pandas https://pandas.pydata.org/docs/reference/api/panda s.DataFrame.sample.html.

Parameters
  • n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.

  • frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.

  • replace (bool) – Sample with or without replacement. Defaults to False.

  • weights (Union[str, np.ndarray]) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If a string, the weights will be applied to the rows based on the column with the name specified. If weights do not sum to 1 they will be normalized to sum to 1.

  • random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.

Returns

A random sample of rows from the DataFrame.

Return type

DataFrame

rename(mapper: Optional[Union[Dict, Callable]] = None, errors: Literal['ignore', 'raise'] = 'ignore') meerkat.dataframe.DataFrame[source]

Return a new DataFrame with the specified column labels renamed.

Dictionary values must be unique (1-to-1). Labels not specified will be left unchanged. Extra labels will not throw an error.

Parameters
  • mapper (Union[Dict, Callable], optional) – Dict-like of function transformations to apply to the values of the columns. Defaults to None.

  • errors (Literal['ignore', 'raise'], optional) – If ‘raise’, raise a KeyError when the Dict contains labels that do not exist in the DataFrame. If ‘ignore’, extra keys will be ignored. Defaults to ‘ignore’.

Raises

ValueError – _description_

Returns

A new DataFrame with the specified column labels renamed.

Return type

DataFrame

drop(columns: Union[str, Collection[str]], check_exists=True) meerkat.dataframe.DataFrame[source]

Return a new DataFrame with the specified columns dropped.

Parameters

columns (Union[str, Collection[str]]) – The columns to drop.

Returns

A new DataFrame with the specified columns dropped.

Return type

DataFrame

classmethod read(path: str, overwrite: bool = False, *args, **kwargs) meerkat.dataframe.DataFrame[source]

Load a DataFrame stored on disk.

write(path: str) None[source]

Save a DataFrame to disk.

class Row[source]
column(data: Sequence) meerkat.columns.abstract.Column[source]

Create a Meerkat column from data.

The Meerkat column type is inferred from the type and structure of the data passed in.

class Column(data: Optional[Sequence] = None, collate_fn: Optional[Callable] = None, formatter: Optional[Callable] = None, *args, **kwargs)[source]

An abstract class for Meerkat columns.

property data

Get the underlying data.

filter(function: Callable, with_indices=False, input_columns: Optional[Union[List[str], str]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: Optional[int] = 0, materialize: bool = True, **kwargs) Optional[meerkat.columns.abstract.Column][source]

Filter the elements of the column using a function.

sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.Column[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

AbstractColumn

argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.Column[source]

Return indices that would sorted the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

AbstractColumn

sample(n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, weights: Optional[Union[str, numpy.ndarray]] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None) meerkat.columns.abstract.Column[source]

Select a random sample of rows from Column. Roughly equivalent to sample in Pandas https://pandas.pydata.org/docs/reference/api/panda s.DataFrame.sample.html.

Parameters
  • n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.

  • frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.

  • replace (bool) – Sample with or without replacement. Defaults to False.

  • weights (np.ndarray) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If weights do not sum to 1 they will be normalized to sum to 1.

  • random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.

Returns

A random sample of rows from the DataFrame.

Return type

AbstractColumn

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, num_workers: int = 0, materialize: bool = True, *args, **kwargs)[source]

Batch the column.

Parameters
  • batch_size – integer batch size

  • drop_last_batch – drop the last batch if its smaller than batch_size

  • collate – whether to collate the returned batches

Returns

batches of data

classmethod from_data(data: Union[Columnable, Column])[source]

Convert data to a meerkat column using the appropriate Column type.

head(n: int = 5) meerkat.columns.abstract.Column[source]

Get the first n examples of the column.

tail(n: int = 5) meerkat.columns.abstract.Column[source]

Get the last n examples of the column.

to_pandas(allow_objects: bool = False) pandas.core.series.Series[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

to_arrow() pyarrow.lib.Array[source]

Convert the column to an Arrow Array.

If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as an Arrow Array.

Return type

pa.Array

to_torch() torch.Tensor[source]

Convert the column to a PyTorch Tensor.

If the column cannot be converted to a PyTorch Tensor, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a PyTorch Tensor.

Return type

torch.Tensor

to_numpy() numpy.ndarray[source]

Convert the column to a Numpy array.

If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Numpy array.

Return type

np.ndarray

class ObjectColumn(data: Optional[Sequence] = None, *args, **kwargs)[source]
batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, *args, **kwargs)[source]

Batch the column.

Parameters
  • batch_size – integer batch size

  • drop_last_batch – drop the last batch if its smaller than batch_size

  • collate – whether to collate the returned batches

Returns

batches of data

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

to_pandas(allow_objects: bool = False) pandas.core.series.Series[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

class ScalarColumn(data: Optional[Union[numpy.ndarray, torch.TensorType, pandas.core.series.Series, List]] = None, backend: Optional[str] = None)[source]
class PandasScalarColumn(data: Optional[Union[numpy.ndarray, torch.TensorType, pandas.core.series.Series, List]] = None, backend: Optional[str] = None)[source]
str

alias of meerkat.columns.scalar.pandas._MeerkatStringMethods

dt

alias of meerkat.columns.scalar.pandas._MeerkatCombinedDatetimelikeProperties

cat

alias of meerkat.columns.scalar.pandas._MeerkatCategoricalAccessor

sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.scalar.pandas.PandasScalarColumn[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

AbstractColumn

argsort(ascending: bool = True, kind: str = 'quicksort') meerkat.columns.scalar.pandas.PandasScalarColumn[source]

Return indices that would sorted the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

PandasSeriesColumn

For now! Raises error when shape of input array is more than one error.

to_tensor() torch.Tensor[source]

Use column.to_tensor() instead of torch.tensor(column), which is very slow.

to_numpy() torch.Tensor[source]

Convert the column to a Numpy array.

If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Numpy array.

Return type

np.ndarray

to_pandas(allow_objects: bool = False) pandas.core.series.Series[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

to_arrow() pyarrow.lib.Array[source]

Convert the column to an Arrow Array.

If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as an Arrow Array.

Return type

pa.Array

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

class ArrowScalarColumn(data: Optional[Union[numpy.ndarray, torch.TensorType, pandas.core.series.Series, List]] = None, backend: Optional[str] = None)[source]
is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

to_numpy()[source]

Convert the column to a Numpy array.

If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Numpy array.

Return type

np.ndarray

to_pandas(allow_objects: bool = False)[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

to_arrow() pyarrow.lib.Array[source]

Convert the column to an Arrow Array.

If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as an Arrow Array.

Return type

pa.Array

class TensorColumn(data: Optional[Union[numpy.ndarray, torch.TensorType]] = None, backend: Optional[str] = None)[source]
class NumPyTensorColumn(data: Optional[Union[numpy.ndarray, torch.TensorType]] = None, backend: Optional[str] = None)[source]
is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

sort(ascending: Union[bool, List[bool]] = True, axis: int = - 1, kind: str = 'quicksort', order: Optional[Union[List[str], str]] = None) meerkat.columns.tensor.numpy.NumPyTensorColumn[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

Column

argsort(ascending: bool = True, kind: str = 'quicksort') meerkat.columns.tensor.numpy.NumPyTensorColumn[source]

Return indices that would sorted the column.

Parameters
  • ascending (bool) – Whether to sort in ascending or descending order.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

NumpySeriesColumn

For now! Raises error when shape of input array is more than one error.

to_torch() torch.Tensor[source]

Convert the column to a PyTorch Tensor.

If the column cannot be converted to a PyTorch Tensor, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a PyTorch Tensor.

Return type

torch.Tensor

to_pandas(allow_objects: bool = False) pandas.core.series.Series[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

to_arrow() pyarrow.lib.Array[source]

Convert the column to an Arrow Array.

If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as an Arrow Array.

Return type

pa.Array

to_numpy() numpy.ndarray[source]

Convert the column to a Numpy array.

If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Numpy array.

Return type

np.ndarray

class TorchTensorColumn(data: Optional[Union[numpy.ndarray, torch.TensorType]] = None, backend: Optional[str] = None)[source]
classmethod from_data(data: Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor, meerkat.columns.abstract.Column])[source]

Convert data to an EmbeddingColumn.

sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor.torch.TorchTensorColumn[source]

Return a sorted view of the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

Column

argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor.torch.TorchTensorColumn[source]

Return indices that would sorted the column.

Parameters
  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A view of the column with the sorted data.

Return type

TensorColumn

For now! Raises error when shape of input array is more than one error.

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

to_pandas(allow_objects: bool = False) pandas.core.series.Series[source]

Convert the column to a Pandas Series.

If the column cannot be converted to a Pandas Series, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Pandas Series.

Return type

pd.Series

to_numpy() pandas.core.series.Series[source]

Convert the column to a Numpy array.

If the column cannot be converted to a Numpy array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as a Numpy array.

Return type

np.ndarray

to_arrow() pyarrow.lib.Array[source]

Convert the column to an Arrow Array.

If the column cannot be converted to an Arrow Array, this method will raise a ~meerkat.errors.ConversionError.

Returns

The column as an Arrow Array.

Return type

pa.Array

class DeferredColumn(data: Union[meerkat.block.deferred_block.DeferredOp, meerkat.block.abstract.BlockView], output_type: Optional[type] = None, *args, **kwargs)[source]
property fn: Callable

Subclasses like ImageColumn should be able to implement their own version.

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

class FileColumn(data: Sequence[str] = None, loader: callable = None, transform: callable = None, downloader: Union[callable | str] = None, base_dir: str = None, *args, **kwargs)[source]

A column where each cell represents an file stored on disk or the web. The underlying data is a PandasSeriesColumn of strings, where each string is the path to a file. The column materializes the files into memory when indexed. If the column is lazy indexed with the lz indexer, the files are not materialized and a FileCell or a FileColumn is returned instead.

Parameters
  • data (Sequence[str]) – A list of filepaths to images.

  • loader (callable) –

    A callable with signature def loader(filepath: str) -> PIL.Image:. Defaults to torchvision.datasets.folder.default_loader.

    Warning

    In order for the column to be serializable with write(), the loader function must be pickleable.

  • transform (callable) –

    A function that transforms the loaded data (e.g. torchvision.transforms.functional.center_crop).

    Warning

    In order for the column to be serializable, the transform function must be pickleable.

  • base_dir (str) – an absolute path to a directory containing the files. If provided, the filepath to be loaded will be joined with the base_dir. As such, this argument should only be used if the loader will be applied to relative paths. The base_dir can also include environment variables (e.g. $DATA_DIR/images) which will be expanded prior to loading. This is useful when sharing DataFrames between machines.

is_equal(other: meerkat.columns.abstract.Column) bool[source]

Tests whether two columns.

Parameters

other (AbstractColumn) – [description]

class ImageColumn(data: Sequence[str] = None, loader: callable = None, transform: callable = None, downloader: Union[callable | str] = None, base_dir: str = None, *args, **kwargs)[source]

A column where each cell represents an image stored on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the lz indexer, the images are not materialized and an ImageCell or an ImageColumn is returned instead.

Parameters
  • data (Sequence[str]) – A list of filepaths to images.

  • transform (callable) –

    A function that transforms the image (e.g. torchvision.transforms.functional.center_crop).

    Warning

    In order for the column to be serializable, the transform function must be pickleable.

  • loader (callable) –

    A callable with signature def loader(filepath: str) -> PIL.Image:. Defaults to torchvision.datasets.folder.default_loader.

    Warning

    In order for the column to be serializable with write(), the loader function must be pickleable.

  • base_dir (str) – A base directory that the paths in data are relative to. If None, the paths are assumed to be absolute.

class AudioColumn(data: Sequence[str] = None, loader: callable = None, transform: callable = None, downloader: Union[callable | str] = None, base_dir: str = None, *args, **kwargs)[source]

A lambda column where each cell represents an audio file on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the lz indexer, the images are not materialized and an FileCell or an AudioColumn is returned instead.

Parameters
  • data (Sequence[str]) – A list of filepaths to images.

  • transform (callable) –

    A function that transforms the image (e.g. torchvision.transforms.functional.center_crop).

    Warning

    In order for the column to be serializable, the transform function must be pickleable.

  • loader (callable) –

    A callable with signature def loader(filepath: str) -> PIL.Image:. Defaults to torchvision.datasets.folder.default_loader.

    Warning

    In order for the column to be serializable with write(), the loader function must be pickleable.

  • base_dir (str) – A base directory that the paths in data are relative to. If None, the paths are assumed to be absolute.

collate(batch)[source]

Collate data.

class AbstractCell(*args, **kwargs)[source]
get(*args, **kwargs) object[source]

Get me the thing that this cell exists for.

property metadata: dict

Get the metadata associated with this cell.

class DeferredCell(data: meerkat.block.deferred_block.DeferredCellOp)[source]
property data: object

Get the data associated with this cell.

get(*args, **kwargs)[source]

Get me the thing that this cell exists for.

class FileCell(data: meerkat.block.deferred_block.DeferredCellOp)[source]
map(data: Union[DataFrame, Column], function: Callable, is_batched_fn: bool = False, batch_size: int = 1, inputs: Union[Mapping[str, str], Sequence[str]] = None, outputs: Union[Mapping[any, str], Sequence[str]] = None, output_type: Union[Mapping[str, type], type] = None, materialize: bool = True, pbar: bool = False, **kwargs)[source]

Create a new Column or DataFrame by applying a function to each row in data.

This function shares nearly the exact same signature with defer(), the difference is that defer() returns a column that has not yet been computed. It is a placeholder for a column that will be computed later.

Learn more in the user guide: Mapping: map and defer.

What gets passed to function?

  • If data is a DataFrame, then the function’s signature is inspected to determine which columns to pass as keyword arguments to the function. For example, if the function is lambda age, residence: age > 18 and residence == "NY", then the columns age and residence will be passed to the function. If the columns are not present in the DataFrame, then a ValueError will be raised. The mapping between columns and function arguments can be overridden by passing a the inputs argument.

  • If data is a Column then values of the column are passed as a single positional argument to the function. The inputs argument is ignored.

What gets returned by map?

  • If function returns a single value, then map will return a Column object.

  • If function returns a dictionary, then map will return a DataFrame. The keys of the dictionary are used as column names. The outputs argument can be used to override the column names.

  • If function returns a tuple, then map will return a DataFrame. The column names will be integers. The column names can be overriden by passing a tuple to the outputs argument.

  • If function returns a tuple or a dictionary, then passing "single" to the outputs argument will cause map to return a single ObjectColumn.

Note

This function is also available as a method of DataFrame and Column under the name map.

Parameters
  • data (DataFrame) – The DataFrame or Column to which the function will be applied.

  • function (Callable) – The function that will be applied to the rows of data.

  • is_batched_fn (bool, optional) – Whether the function must be applied on a batch of rows. Defaults to False.

  • batch_size (int, optional) – The size of the batch. Defaults to 1.

  • inputs (Dict[str, str], optional) – Dictionary mapping column names in data to keyword arguments of function. Ignored if data is a column. When calling function values from the columns will be fed to the corresponding keyword arguments. Defaults to None, in which case it inspects the signature of the function. It then finds the columns with the same names in the DataFrame and passes the corresponding values to the function. If the function takes a non-default argument that is not a column in the DataFrame, the operation will raise a ValueError.

  • outputs (Union[Dict[any, str], Tuple[str]], optional) –

    Controls how the output of function is mapped to the output of defer(). Defaults to None.

    • If None: the output is inferred from the return type of the function. See explanation above.

    • If "single": a single DeferredColumn is returned.

    • If a Dict[any, str]: then a DataFrame containing DeferredColumns is returned. This is useful when the output of function is a Dict. outputs maps the outputs of function to column names in the resulting DataFrame.

    • If a Tuple[str]: then a DataFrame containing output DeferredColumn is returned. This is useful when the of function is a Tuple. outputs maps the outputs of function to column names in the resulting DataFrame.

  • output_type (Union[Dict[str, type], type], optional) – Coerce the column. Defaults to None.

  • pbar (bool) – Show a progress bar. Defaults to False.

Returns

A DeferredColumn or a

DataFrame containing DeferredColumn representing the deferred map.

Return type

Union[DataFrame, DeferredColumn]

Examples

We start with a small DataFrame of voters with two columns: birth_year, which contains the birth year of each person, and residence, which contains the state in which each person lives.

In [1]: import datetime

In [2]: import meerkat as mk

In [3]: df = mk.DataFrame({
   ...:     "birth_year": [1967, 1993, 2010, 1985, 2007, 1990, 1943],
   ...:     "residence": ["MA", "LA", "NY", "NY", "MA", "MA", "LA"]
   ...: })
   ...: 

Single input column. Lazily create a column of birth years to a column of ages.

In [4]: df["age"] = df["birth_year"].map(
   ...:     lambda x: datetime.datetime.now().year - x
   ...: )
   ...: 

In [5]: df["age"]
Out[5]: 
PandasScalarColumn(0    56
1    ..., dtype: int64)

Multiple input columns. Lazily create a column of birth years to a column of ages.

In [6]: df["ma_eligible"] = df.map(
   ...:     lambda age, residence: (residence == "MA") and (age >= 18)
   ...: )
   ...: 

In [7]: df["ma_eligible"]
Out[7]: 
PandasScalarColumn(0     True
1 ...l, dtype: bool)
defer(data: Union[DataFrame, Column], function: Callable, is_batched_fn: bool = False, batch_size: int = 1, inputs: Union[Mapping[str, str], Sequence[str]] = None, outputs: Union[Mapping[any, str], Sequence[str]] = None, output_type: Union[Mapping[str, type], type] = None, materialize: bool = True) Union[DataFrame, DeferredColumn][source]

Create one or more DeferredColumns that lazily applies a function to each row in data.

This function shares nearly the exact same signature with map(), the difference is that defer() returns a column that has not yet been computed. It is a placeholder for a column that will be computed later.

Learn more in the user guide: Deferred map and chaining.

What gets passed to function?

  • If data is a DataFrame, then the function’s signature is inspected to determine which columns to pass as keyword arguments to the function. For example, if the function is lambda age, residence: age > 18 and residence == "NY", then the columns age and residence will be passed to the function. If the columns are not present in the DataFrame, then a ValueError will be raised. The mapping between columns and function arguments can be overridden by passing a the inputs argument.

  • If data is a Column then values of the column are passed as a single positional argument to the function. The inputs argument is ignored.

What gets returned by defer?

  • If function returns a single value, then defer will return a DeferredColumn object.

  • If function returns a dictionary, then defer will return a DataFrame containing DeferredColumn objects. The keys of the dictionary are used as column names. The outputs argument can be used to override the column names.

  • If function returns a tuple, then defer will return a DataFrame containing DeferredColumn objects. The column names will be integers. The column names can be overriden by passing a tuple to the outputs argument.

  • If function returns a tuple or a dictionary, then passing "single" to the outputs argument will cause defer to return a single DeferredColumn that materializes to a ObjectColumn.

How do you execute the deferred map?

Depending on function and the outputs argument, returns either a DeferredColumn or a DataFrame. Both are callables. To execute the deferred map, simply call the returned object.

Note

This function is also available as a method of DataFrame and Column under the name defer.

Parameters
  • data (DataFrame) – The DataFrame or Column to which the function will be applied.

  • function (Callable) – The function that will be applied to the rows of data.

  • is_batched_fn (bool, optional) – Whether the function must be applied on a batch of rows. Defaults to False.

  • batch_size (int, optional) – The size of the batch. Defaults to 1.

  • inputs (Dict[str, str], optional) – Dictionary mapping column names in data to keyword arguments of function. Ignored if data is a column. When calling function values from the columns will be fed to the corresponding keyword arguments. Defaults to None, in which case it inspects the signature of the function. It then finds the columns with the same names in the DataFrame and passes the corresponding values to the function. If the function takes a non-default argument that is not a column in the DataFrame, the operation will raise a ValueError.

  • outputs (Union[Dict[any, str], Tuple[str]], optional) –

    Controls how the output of function is mapped to the output of defer(). Defaults to None.

    • If None: the output is inferred from the return type of the function. See explanation above.

    • If "single": a single DeferredColumn is returned.

    • If a Dict[any, str]: then a DataFrame containing DeferredColumns is returned. This is useful when the output of function is a Dict. outputs maps the outputs of function to column names in the resulting DataFrame.

    • If a Tuple[str]: then a DataFrame containing output DeferredColumn is returned. This is useful when the of function is a Tuple. outputs maps the outputs of function to column names in the resulting DataFrame.

  • output_type (Union[Dict[str, type], type], optional) – Coerce the column. Defaults to None.

Returns

A DeferredColumn or a

DataFrame containing DeferredColumn representing the deferred map.

Return type

Union[DataFrame, DeferredColumn]

Examples

We start with a small DataFrame of voters with two columns: birth_year, which contains the birth year of each person, and residence, which contains the state in which each person lives.

In [1]: import datetime

In [2]: import meerkat as mk

In [3]: df = mk.DataFrame({
   ...:     "birth_year": [1967, 1993, 2010, 1985, 2007, 1990, 1943],
   ...:     "residence": ["MA", "LA", "NY", "NY", "MA", "MA", "LA"]
   ...: })
   ...: 

Single input column. Lazily create a column of birth years to a column of ages.

In [4]: df["age"] = df["birth_year"].defer(
   ...:     lambda x: datetime.datetime.now().year - x
   ...: )
   ...: 

In [5]: df["age"]
Out[5]: DeferredColumn(DeferredOp(ar...rn_index=None))

We can materialize the deferred map (i.e. run it) by calling the column.

In [6]: df["age"]()
Out[6]: 
PandasScalarColumn(0    56
1    ..., dtype: int64)

Multiple input columns. Lazily create a column of birth years to a column of ages.

In [7]: df["ma_eligible"] = df.defer(
   ...:     lambda age, residence: (residence == "MA") and (age >= 18)
   ...: )
   ...: 

In [8]: df["ma_eligible"]()
Out[8]: 
PandasScalarColumn(0     True
1 ...l, dtype: bool)
concat(objs: Union[Sequence[meerkat.dataframe.DataFrame], Sequence[meerkat.columns.abstract.Column]], axis: Union[str, int] = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column][source]

Concatenate a sequence of columns or a sequence of DataFrame`s. If sequence is empty, returns an empty `DataFrame.

  • If concatenating columns, all columns must be of the same type (e.g. all

ListColumn). - If concatenating `DataFrame`s along axis 0 (rows), all `DataFrame`s must have the same set of columns. - If concatenating `DataFrame`s along axis 1 (columns), all `DataFrame`s must have the same length and cannot have any of the same column names.

Parameters
  • objs (Union[Sequence[DataFrame], Sequence[AbstractColumn]]) – sequence of columns or DataFrames.

  • axis (Union[str, int]) – The axis along which to concatenate. Ignored if concatenating columns.

Returns

concatenated DataFrame or column

Return type

Union[DataFrame, AbstractColumn]

merge(left: meerkat.dataframe.DataFrame, right: meerkat.dataframe.DataFrame, how: str = 'inner', on: Union[str, List[str]] = None, left_on: Union[str, List[str]] = None, right_on: Union[str, List[str]] = None, sort: bool = False, suffixes: Sequence[str] = ('_x', '_y'), validate=None) meerkat.dataframe.DataFrame[source]

Perform a database-style join operation between two DataFrames.

Parameters
  • left (DataFrame) – Left DataFrame.

  • right (DataFrame) – Right DataFrame.

  • how (str, optional) – The join type. Defaults to “inner”.

  • on (Union[str, List[str]], optional) – The columns(s) to join on. These columns must be ScalarColumn. Defaults to None, in which case the left_on and right_on parameters must be passed.

  • left_on (Union[str, List[str]], optional) – The column(s) in the left DataFrame to join on. These columns must be ScalarColumn. Defaults to None.

  • right_on (Union[str, List[str]], optional) – The column(s) in the right DataFrame to join on. These columns must be ScalarColumn. Defaults to None.

  • sort (bool, optional) – Whether to sort the result DataFrame by the join key(s). Defaults to False.

  • suffixes (Sequence[str], optional) – Suffixes to use in the case their are conflicting column names in the result DataFrame. Should be a sequence of length two, with suffixes[0] the suffix for the column from the left DataFrame and suffixes[1] the suffix for the right. Defaults to (“_x”, “_y”).

  • validate (_type_, optional) –

    The check to perform on the result DataFrame. Defaults to None, in which case no check is performed. Valid options are:

    • “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.

    • “one_to_many” or “1:m”: check if merge keys are unique in left dataset.

    • “many_to_one” or “m:1”: check if merge keys are unique in right dataset.

    • “many_to_many” or “m:m”: allowed, but does not result in checks.

Returns

The merged DataFrame.

Return type

DataFrame

embed(data: Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column, str, PIL.Image.Image], input: Optional[str] = None, encoder: Union[str, meerkat.ops.embed.encoder.Encoder] = 'clip', modality: Optional[str] = None, out_col: Optional[str] = None, device: Union[int, str] = 'auto', mmap_dir: Optional[str] = None, num_workers: int = 0, batch_size: int = 128, pbar: bool = True, **kwargs) Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column][source]

Embed a column of data with an encoder from the encoder registry.

Examples

Suppose you have an Image dataset (e.g. Imagenette, CIFAR-10) loaded into a Meerkat DataFrame. You can embed the images in the dataset with CLIP using a code snippet like:

import meerkat as mk

df = mk.datasets.get("imagenette")

df = mk.embed(
    data=df,
    input_col="img",
    encoder="clip"
)
Parameters
  • data (Union[mk.DataFrame, mk.AbstractColumn]) – A dataframe or column containing the data to embed.

  • input_col (str, optional) – If data is a dataframe, the name of the column to embed. If data is a column, then the parameter is ignored. Defaults to None.

  • encoder (Union[str, Encoder], optional) – Name of the encoder to use. List supported encoders with domino.encoders. Defaults to “clip”. Alternatively, pass an Encoder object containing a custom encoder.

  • modality (str, optional) – The modality of the data to be embedded. Defaults to None, in which case the modality is inferred from the type of the input column.

  • out_col (str, optional) – The name of the column where the embeddings are stored. Defaults to None, in which case it is "{encoder}({input_col})".

  • device (Union[int, str], optional) – The device on which. Defaults to “cpu”.

  • mmap_dir (str, optional) – The path to directory where a memory-mapped file containing the embeddings will be written. Defaults to None, in which case the embeddings are not memmapped.

  • num_workers (int, optional) – Number of worker processes used to load the data from disk. Defaults to 4.

  • batch_size (int, optional) – Size of the batches to used . Defaults to 128.

  • **kwargs – Additional keyword arguments are passed to the encoder. To see supported arguments for each encoder, see the encoder documentation (e.g. clip()).

Returns

A view of data with a new column containing the embeddings. This column will be named according to the out_col parameter.

Return type

mk.DataFrame

sort(data: meerkat.dataframe.DataFrame, by: Union[str, List[str]], ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.dataframe.DataFrame[source]

Sort a DataFrame or Column. If a DataFrame, sort by the values in the specified columns. Similar to sort_values in pandas.

Parameters
  • data (Union[DataFrame, AbstractColumn]) – DataFrame or Column to sort.

  • by (Union[str, List[str]]) – The columns to sort by. Ignored if data is a Column.

  • ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by.Defaults to True.

  • kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.

Returns

A sorted view of DataFrame.

Return type

DataFrame

sample(data: Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column], n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, weights: Optional[Union[str, numpy.ndarray]] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None) Union[meerkat.dataframe.DataFrame, meerkat.columns.abstract.Column][source]

Select a random sample of rows from DataFrame or Column. Roughly equivalent to sample in Pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html.

Parameters
  • data (Union[DataFrame, AbstractColumn]) – DataFrame or Column to sample from.

  • n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.

  • frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.

  • replace (bool) – Sample with or without replacement. Defaults to False.

  • weights (Union[str, np.ndarray]) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If a string and data is a DataFrame, the weights will be applied to the rows based on the column with the name specified. If weights do not sum to 1 they will be normalized to sum to 1.

  • random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.

Returns

A random sample of rows from DataFrame or

Column.

Return type

Union[DataFrame, AbstractColumn]

groupby(data: meerkat.dataframe.DataFrame, by: Optional[Union[str, Sequence[str]]] = None) meerkat.ops.sliceby.groupby.GroupBy[source]

Perform a groupby operation on a DataFrame or Column (similar to a DataFrame.groupby and Series.groupby operations in Pandas).j.

Parameters
  • data (Union[DataFrame, AbstractColumn]) – The data to group.

  • by (Union[str, Sequence[str]]) – The column(s) to group by. Ignored if data is a Column.

Returns

A GroupBy object.

Return type

Union[DataFrameGroupBy, AbstractColumnGroupBy]

clusterby(data: meerkat.dataframe.DataFrame, by: Union[str, Sequence[str]], method: Union[str, sklearn.base.ClusterMixin] = 'KMeans', encoder: str = 'clip', modality: Optional[str] = None, **kwargs) meerkat.ops.sliceby.clusterby.ClusterBy[source]

Perform a clusterby operation on a DataFrame.

Parameters
  • data (DataFrame) – The dataframe to cluster.

  • by (Union[str, Sequence[str]]) – The column(s) to cluster by. These columns will be embedded using the encoder and the resulting embedding will be used.

  • method (Union[str, ClusterMixin]) – The clustering method to use.

  • encoder (str) – The encoder to use for the embedding. Defaults to clip.

  • modality (Union[str, Sequence[str])) – The modality to of the

  • **kwargs – Additional keyword arguments to pass to the clustering method.

Returns

A ClusterBy object.

Return type

ClusterBy

explainby(data: DataFrame, by: Union[str, Sequence[str]], target: Union[str, Mapping[str]], method: Union[str, 'domino.Slicer'] = 'MixtureSlicer', encoder: str = 'clip', modality: str = None, scores: bool = False, use_cache: bool = True, **kwargs) ExplainBy[source]

Perform a clusterby operation on a DataFrame.

Parameters
  • data (DataFrame) – The dataframe to cluster.

  • by (Union[str, Sequence[str]]) – The column(s) to cluster by. These columns will be embedded using the encoder and the resulting embedding will be used.

  • method (Union[str, domino.Slicer]) – The clustering method to use.

  • encoder (str) – The encoder to use for the embedding. Defaults to clip.

  • modality (Union[str, Sequence[str])) – The modality to of the

  • **kwargs – Additional keyword arguments to pass to the clustering method.

Returns

A ExplainBy object.

Return type

ExplainBy

cand(*args)[source]

Overloaded and operator.

Use this when you want to use the and operator on reactive values (e.g. Store)

Parameters

*args – The arguments to and together.

Returns

The result of the and operation.

cor(*args)[source]

Overloaded or operator.

Use this when you want to use the or operator on reactive values (e.g. Store)

Parameters

*args – The arguments to or together.

Returns

The result of the or operation.

cnot(x)[source]

Overloaded not operator.

Use this when you want to use the not operator on reactive values (e.g. Store).

Parameters

x – The arguments to not.

Returns

The result of the and operation.

from_csv(filepath: str, primary_key: str = None, *args, **kwargs) meerkat.dataframe.DataFrame

Create a DataFrame from a csv file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
  • filepath (str) – The file path or buffer to load from. Same as pandas.read_csv().

  • *args – Argument list for pandas.read_csv().

  • **kwargs – Keyword arguments forwarded to pandas.read_csv().

Returns

The constructed dataframe.

Return type

DataFrame

from_json(filepath: str, primary_key: str = None, orient: str = 'records', lines: bool = False, **kwargs) meerkat.dataframe.DataFrame

Load a DataFrame from a json file.

By default, data in the JSON file should be a list of dictionaries, each with an entry for each column. This is the orient="records" format. If the data is in a different format in the JSON, you can specify the orient parameter. See pandas.read_json() for more details.

Parameters
  • filepath (str) – The file path or buffer to load from. Same as pandas.read_json().

  • orient (str) – The expected JSON string format. Options are: “split”, “records”, “index”, “columns”, “values”. Same as pandas.read_json().

  • lines (bool) – Whether the json file is a jsonl file. Same as pandas.read_json().

  • **kwargs – Keyword arguments forwarded to pandas.read_json().

Returns

The constructed dataframe.

Return type

DataFrame

from_parquet(filepath: str, primary_key: str = None, engine: str = 'auto', columns: Optional[Sequence[str]] = None, **kwargs) meerkat.dataframe.DataFrame

Create a DataFrame from a parquet file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
  • filepath (str) – The file path or buffer to load from. Same as pandas.read_parquet().

  • engine (str) – The parquet engine to use. Same as pandas.read_parquet().

  • columns (Optional[Sequence[str]]) – The columns to load. Same as pandas.read_parquet().

  • **kwargs – Keyword arguments forwarded to pandas.read_parquet().

Returns

The constructed dataframe.

Return type

DataFrame

from_feather(filepath: str, primary_key: str = None, columns: Optional[Sequence[str]] = None, use_threads: bool = True, **kwargs) meerkat.dataframe.DataFrame

Create a DataFrame from a feather file. All of the columns will be meerkat.ScalarColumn with backend Pandas.

Parameters
  • filepath (str) – The file path or buffer to load from. Same as pandas.read_feather().

  • columns (Optional[Sequence[str]]) – The columns to load. Same as pandas.read_feather().

  • use_threads (bool) – Whether to use threads to read the file. Same as pandas.read_feather().

  • **kwargs – Keyword arguments forwarded to pandas.read_feather().

Returns

The constructed dataframe.

Return type

DataFrame

from_pandas(df: pandas.core.frame.DataFrame, index: bool = True, primary_key: Optional[str] = None) meerkat.dataframe.DataFrame

Create a Meerkat DataFrame from a Pandas DataFrame.

Warning

In Meerkat, column names must be strings, so non-string column names in the Pandas DataFrame will be converted.

Parameters
  • df – The Pandas DataFrame to convert.

  • index – Whether to include the index of the Pandas DataFrame as a column in the Meerkat DataFrame.

  • primary_key – The name of the column to use as the primary key. If index is True and primary_key is None, the index will be used as the primary key. If index is False, then no primary key will be set. Optional default is None.

Returns

The Meerkat DataFrame.

Return type

DataFrame

from_arrow(table: pyarrow.lib.Table)

Create a Dataset from a pandas DataFrame.

from_huggingface(*args, **kwargs)

Load a Huggingface dataset as a DataFrame.

Use this to replace datasets.load_dataset, so

>>> dict_of_datasets = datasets.load_dataset('boolq')

becomes

>>> dict_of_dataframes = DataFrame.from_huggingface('boolq')
read(path: str, overwrite: bool = False, *args, **kwargs) meerkat.dataframe.DataFrame

Load a DataFrame stored on disk.

get(name: str, dataset_dir: Optional[str] = None, version: Optional[str] = None, download_mode: str = 'reuse', registry: Optional[str] = None, **kwargs) Union[meerkat.dataframe.DataFrame, Dict[str, meerkat.dataframe.DataFrame]][source]

Load a dataset into .

Parameters
  • name (str) – Name of the dataset.

  • dataset_dir (str) – The directory containing dataset data. Defaults to ~/.meerkat/datasets/{name}.

  • version (str) – The version of the dataset. Defaults to latest.

  • download_mode (str) – The download mode. Options are: “reuse” (default) will download the dataset if it does not exist, “force” will download the dataset even if it exists, “extract” will reuse any downloaded archives but force extracting those archives, and “skip” will not download the dataset if it doesn’t yet exist. Defaults to reuse.

  • registry (str) – The registry to use. If None, then checks each supported registry in turn. Currently, supported registries include meerkat and huggingface.

  • **kwargs – Additional arguments passed to the dataset.

DataPanel

alias of meerkat.dataframe.DataFrame

scalar

alias of meerkat.columns.scalar.abstract.ScalarColumn

tensor

alias of meerkat.columns.tensor.abstract.TensorColumn

deferred

alias of meerkat.columns.deferred.base.DeferredColumn

objects

alias of meerkat.columns.object.base.ObjectColumn

files

alias of meerkat.columns.deferred.file.FileColumn

image

alias of meerkat.columns.deferred.image.ImageColumn

audio

alias of meerkat.columns.deferred.audio.AudioColumn