meerkat package
Subpackages
- meerkat.block package
- meerkat.cells package
- meerkat.columns package
- Submodules
- meerkat.columns.abstract module
- meerkat.columns.arrow_column module
- meerkat.columns.audio_column module
- meerkat.columns.cell_column module
- meerkat.columns.file_column module
- meerkat.columns.image_column module
- meerkat.columns.lambda_column module
- meerkat.columns.list_column module
- meerkat.columns.numpy_column module
- meerkat.columns.pandas_column module
- meerkat.columns.spacy_column module
- meerkat.columns.tensor_column module
- meerkat.columns.volume_column module
- Module contents
- meerkat.datasets package
- Subpackages
- meerkat.datasets.audioset package
- meerkat.datasets.celeba package
- meerkat.datasets.coco package
- meerkat.datasets.dew package
- meerkat.datasets.eeg package
- meerkat.datasets.enron package
- meerkat.datasets.gqa package
- meerkat.datasets.imagenet package
- meerkat.datasets.imagenette package
- meerkat.datasets.inaturalist package
- meerkat.datasets.mimic package
- meerkat.datasets.mir package
- meerkat.datasets.pascal package
- meerkat.datasets.siim_cxr package
- meerkat.datasets.torchaudio package
- meerkat.datasets.torchvision package
- meerkat.datasets.video_corruptions package
- meerkat.datasets.visual_genome package
- meerkat.datasets.waterbirds package
- meerkat.datasets.wilds package
- Submodules
- meerkat.datasets.abstract module
- meerkat.datasets.fsdd module
- meerkat.datasets.info module
- meerkat.datasets.registry module
- meerkat.datasets.utils module
- Module contents
- Subpackages
- meerkat.logging package
- meerkat.mixins package
- Submodules
- meerkat.mixins.blockable module
- meerkat.mixins.cloneable module
- meerkat.mixins.collate module
- meerkat.mixins.file module
- meerkat.mixins.inspect_fn module
- meerkat.mixins.io module
- meerkat.mixins.lambdable module
- meerkat.mixins.mapping module
- meerkat.mixins.materialize module
- Module contents
- meerkat.ml package
- Submodules
- meerkat.ml.activation module
- meerkat.ml.callbacks module
- meerkat.ml.embedding_column module
- meerkat.ml.huggingfacemodel module
- meerkat.ml.instances_column module
- meerkat.ml.metrics module
- meerkat.ml.model module
- meerkat.ml.prediction_column module
- meerkat.ml.segmentation_column module
- meerkat.ml.tensormodel module
- Module contents
- meerkat.ops package
- meerkat.pipelines package
- meerkat.tools package
- meerkat.writers package
Submodules
meerkat.config module
- class DatasetsConfig(root_dir: 'str' = '/home/docs/.meerkat/datasets')[source]
Bases:
object- root_dir: str = '/home/docs/.meerkat/datasets'
- class DisplayConfig(max_rows: 'int' = 10, show_images: 'bool' = True, max_image_height: 'int' = 128, max_image_width: 'int' = 128, show_audio: 'bool' = True)[source]
Bases:
object- max_image_height: int = 128
- max_image_width: int = 128
- max_rows: int = 10
- show_audio: bool = True
- show_images: bool = True
- class MeerkatConfig(display: 'DisplayConfig', datasets: 'DatasetsConfig')[source]
Bases:
object- datasets: meerkat.config.DatasetsConfig
- display: meerkat.config.DisplayConfig
meerkat.datapanel module
DataPanel class.
- class DataPanel(data: Optional[Union[dict, list]] = None, *args, **kwargs)[source]
Bases:
meerkat.mixins.cloneable.CloneableMixin,meerkat.mixins.inspect_fn.FunctionInspectorMixin,meerkat.mixins.lambdable.LambdaMixin,meerkat.mixins.mapping.MappableMixin,meerkat.mixins.materialize.MaterializationMixin,meerkat.provenance.ProvenanceMixinMeerkat DataPanel class.
- add_column(name: str, data: Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor], overwrite=False) None[source]
Add a column to the DataPanel.
- append(dp: meerkat.datapanel.DataPanel, axis: Union[str, int] = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) meerkat.datapanel.DataPanel[source]
Append a batch of data to the dataset.
example_or_batch must have the same columns as the dataset (regardless of what columns are visible).
- batch(batch_size: int = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, shuffle: bool = False, *args, **kwargs)[source]
Batch the dataset. TODO:
- Parameters
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
- Returns
batches of data
- filter(function: Optional[Callable] = None, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, pbar: bool = False, **kwargs) Optional[meerkat.datapanel.DataPanel][source]
Filter operation on the DataPanel.
- classmethod from_batch(batch: Dict[str, Union[List, meerkat.columns.abstract.AbstractColumn]]) meerkat.datapanel.DataPanel[source]
Convert a batch to a Dataset.
- classmethod from_batches(batches: Sequence[Dict[str, Union[List, meerkat.columns.abstract.AbstractColumn]]]) meerkat.datapanel.DataPanel[source]
Convert a list of batches to a dataset.
- classmethod from_csv(filepath: str, *args, **kwargs)[source]
Create a Dataset from a csv file.
- Parameters
filepath (str) – The file path or buffer to load from. Same as
pandas.read_csv().*args – Argument list for
pandas.read_csv().**kwargs – Keyword arguments for
pandas.read_csv().
- Returns
The constructed datapanel.
- Return type
- classmethod from_dict(d: Dict) meerkat.datapanel.DataPanel[source]
Convert a dictionary to a dataset.
Alias for Dataset.from_batch(..).
- classmethod from_huggingface(*args, **kwargs)[source]
Load a Huggingface dataset as a DataPanel.
Use this to replace datasets.load_dataset, so
>>> dict_of_datasets = datasets.load_dataset('boolq')
becomes
>>> dict_of_datapanels = DataPanel.from_huggingface('boolq')
- classmethod from_jsonl(json_path: str) meerkat.datapanel.DataPanel[source]
Load a dataset from a .jsonl file on disk, where each line of the json file consists of a single example.
- classmethod from_pandas(df: pandas.core.frame.DataFrame)[source]
Create a Dataset from a pandas DataFrame.
- head(n: int = 5) meerkat.datapanel.DataPanel[source]
Get the first n examples of the DataPanel.
- map(function: Optional[Callable] = None, with_indices: bool = False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: int = 0, output_type: Union[type, Dict[str, type]] = None, mmap: bool = False, mmap_path: str = None, materialize: bool = True, pbar: bool = False, **kwargs) Optional[Union[Dict, List, meerkat.columns.abstract.AbstractColumn]][source]
- mean(*args, **kwargs) meerkat.datapanel.DataPanel[source]
- merge(right: meerkat.datapanel.DataPanel, how: str = 'inner', on: Optional[Union[str, List[str]]] = None, left_on: Optional[Union[str, List[str]]] = None, right_on: Optional[Union[str, List[str]]] = None, sort: bool = False, suffixes: Sequence[str] = ('_x', '_y'), validate=None)[source]
- classmethod read(path: str, *args, **kwargs) meerkat.datapanel.DataPanel[source]
Load a DataPanel stored on disk.
- sample(n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, weights: Optional[Union[str, numpy.ndarray]] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None) meerkat.datapanel.DataPanel[source]
Select a random sample of rows from DataPanel. Roughly equivalent to
samplein Pandas https://pandas.pydata.org/docs/reference/api/panda s.DataFrame.sample.html.- Parameters
n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.
frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.
replace (bool) – Sample with or without replacement. Defaults to False.
weights (Union[str, np.ndarray]) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If a string, the weights will be applied to the rows based on the column with the name specified. If weights do not sum to 1 they will be normalized to sum to 1.
random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.
- Returns
A random sample of rows from the DataPanel.
- Return type
- sort(by: Union[str, List[str]], ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.datapanel.DataPanel[source]
Sort the DataPanel by the values in the specified columns. Similar to
sort_valuesin pandas.- Parameters
by (Union[str, List[str]]) – The columns to sort by.
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by.Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A sorted view of DataPanel.
- Return type
- tail(n: int = 5) meerkat.datapanel.DataPanel[source]
Get the last n examples of the DataPanel.
- update(function: Optional[Callable] = None, with_indices: bool = False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, remove_columns: Optional[List[str]] = None, num_workers: int = 0, output_type: Union[type, Dict[str, type]] = None, mmap: bool = False, mmap_path: str = None, materialize: bool = True, pbar: bool = False, **kwargs) meerkat.datapanel.DataPanel[source]
Update the columns of the dataset.
- property columns
Column names in the DataPanel.
- property data: meerkat.block.manager.BlockManager
Get the underlying data (excluding invisible rows).
To access underlying data with invisible rows, use _data.
- logdir: pathlib.Path = PosixPath('/home/docs/meerkat')
- property ncols
Number of rows in the DataPanel.
- property nrows
Number of rows in the DataPanel.
- property shape
Shape of the DataPanel (num_rows, num_columns).
meerkat.display module
- lambda_cell_formatter(cell: LambdaCell)[source]
meerkat.errors module
meerkat.provenance module
- class ProvenanceNode[source]
Bases:
object- add_child(node: meerkat.provenance.ProvenanceNode, key: Tuple)[source]
- add_parent(node: meerkat.provenance.ProvenanceNode, key: Tuple)[source]
- property children
- property last_parent
- property parents
- class ProvenanceObjNode(obj: meerkat.provenance.ProvenanceMixin)[source]
- visualize_provenance(obj: Union[meerkat.provenance.ProvenanceObjNode, meerkat.provenance.ProvenanceOpNode], show_columns: bool = False, last_parent_only: bool = False)[source]
meerkat.version module
Module contents
Meerkat.
- class AbstractCell(*args, **kwargs)[source]
Bases:
abc.ABC- property metadata: dict
Get the metadata associated with this cell.
- class AbstractColumn(data: Optional[Sequence] = None, collate_fn: Optional[Callable] = None, formatter: Optional[Callable] = None, *args, **kwargs)[source]
Bases:
meerkat.mixins.blockable.BlockableMixin,meerkat.mixins.cloneable.CloneableMixin,meerkat.mixins.collate.CollateMixin,meerkat.mixins.io.ColumnIOMixin,meerkat.mixins.inspect_fn.FunctionInspectorMixin,meerkat.mixins.lambdable.LambdaMixin,meerkat.mixins.mapping.MappableMixin,meerkat.mixins.materialize.MaterializationMixin,meerkat.provenance.ProvenanceMixin,abc.ABCAn abstract class for Meerkat columns.
- append(column: meerkat.columns.abstract.AbstractColumn) None[source]
- argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.AbstractColumn[source]
Return indices that would sorted the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
- batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, num_workers: int = 0, materialize: bool = True, *args, **kwargs)[source]
Batch the column.
- Parameters
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
collate – whether to collate the returned batches
- Returns
batches of data
- static concat(columns: Sequence[meerkat.columns.abstract.AbstractColumn]) None[source]
- filter(function: Callable, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: Optional[int] = 0, materialize: bool = True, pbar: bool = False, **kwargs) Optional[meerkat.columns.abstract.AbstractColumn][source]
Filter the elements of the column using a function.
- classmethod from_data(data: Union[Columnable, AbstractColumn])[source]
Convert data to a meerkat column using the appropriate Column type.
- classmethod get_writer(mmap: bool = False, template: Optional[meerkat.columns.abstract.AbstractColumn] = None)[source]
- head(n: int = 5) meerkat.columns.abstract.AbstractColumn[source]
Get the first n examples of the column.
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- sample(n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, weights: Optional[Union[str, numpy.ndarray]] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None) meerkat.columns.abstract.AbstractColumn[source]
Select a random sample of rows from Column. Roughly equivalent to
samplein Pandas https://pandas.pydata.org/docs/reference/api/panda s.DataFrame.sample.html.- Parameters
n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.
frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.
replace (bool) – Sample with or without replacement. Defaults to False.
weights (np.ndarray) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If weights do not sum to 1 they will be normalized to sum to 1.
random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.
- Returns
A random sample of rows from the DataPanel.
- Return type
- sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.AbstractColumn[source]
Return a sorted view of the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
- tail(n: int = 5) meerkat.columns.abstract.AbstractColumn[source]
Get the last n examples of the column.
- Columnable
alias of
Union[Sequence,numpy.ndarray,pandas.core.series.Series,torch.Tensor]
- property data
Get the underlying data.
- property formatter: Callable
- property is_mmap
- logdir: pathlib.Path = PosixPath('/home/docs/meerkat')
- property metadata
- class ArrowArrayColumn(data: Sequence, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn- block_class
alias of
meerkat.block.arrow_block.ArrowBlock
- classmethod concat(columns: Sequence[meerkat.columns.arrow_column.ArrowArrayColumn])[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- class AudioColumn(data: Optional[Sequence[str]] = None, transform: Optional[callable] = None, loader: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.file_column.FileColumnA lambda column where each cell represents an audio file on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the
lzindexer, the images are not materialized and anFileCellor anAudioColumnis returned instead.- Parameters
data (Sequence[str]) – A list of filepaths to images.
transform (callable) –
A function that transforms the image (e.g.
torchvision.transforms.functional.center_crop).Warning
In order for the column to be serializable, the transform function must be pickleable.
loader (callable) –
A callable with signature
def loader(filepath: str) -> PIL.Image:. Defaults totorchvision.datasets.folder.default_loader.Warning
In order for the column to be serializable with
write(), the loader function must be pickleable.base_dir (str) – A base directory that the paths in
dataare relative to. IfNone, the paths are assumed to be absolute.
- class CellColumn(cells: Optional[Sequence[meerkat.cells.abstract.AbstractCell]] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn- static concat(columns: Sequence[meerkat.columns.cell_column.CellColumn])[source]
- classmethod from_cells(cells: Sequence[meerkat.cells.abstract.AbstractCell], *args, **kwargs)[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- property cells
- class DataPanel(data: Optional[Union[dict, list]] = None, *args, **kwargs)[source]
Bases:
meerkat.mixins.cloneable.CloneableMixin,meerkat.mixins.inspect_fn.FunctionInspectorMixin,meerkat.mixins.lambdable.LambdaMixin,meerkat.mixins.mapping.MappableMixin,meerkat.mixins.materialize.MaterializationMixin,meerkat.provenance.ProvenanceMixinMeerkat DataPanel class.
- add_column(name: str, data: Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor], overwrite=False) None[source]
Add a column to the DataPanel.
- append(dp: meerkat.datapanel.DataPanel, axis: Union[str, int] = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) meerkat.datapanel.DataPanel[source]
Append a batch of data to the dataset.
example_or_batch must have the same columns as the dataset (regardless of what columns are visible).
- batch(batch_size: int = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, shuffle: bool = False, *args, **kwargs)[source]
Batch the dataset. TODO:
- Parameters
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
- Returns
batches of data
- filter(function: Optional[Callable] = None, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, pbar: bool = False, **kwargs) Optional[meerkat.datapanel.DataPanel][source]
Filter operation on the DataPanel.
- classmethod from_batch(batch: Dict[str, Union[List, meerkat.columns.abstract.AbstractColumn]]) meerkat.datapanel.DataPanel[source]
Convert a batch to a Dataset.
- classmethod from_batches(batches: Sequence[Dict[str, Union[List, meerkat.columns.abstract.AbstractColumn]]]) meerkat.datapanel.DataPanel[source]
Convert a list of batches to a dataset.
- classmethod from_csv(filepath: str, *args, **kwargs)[source]
Create a Dataset from a csv file.
- Parameters
filepath (str) – The file path or buffer to load from. Same as
pandas.read_csv().*args – Argument list for
pandas.read_csv().**kwargs – Keyword arguments for
pandas.read_csv().
- Returns
The constructed datapanel.
- Return type
- classmethod from_dict(d: Dict) meerkat.datapanel.DataPanel[source]
Convert a dictionary to a dataset.
Alias for Dataset.from_batch(..).
- classmethod from_huggingface(*args, **kwargs)[source]
Load a Huggingface dataset as a DataPanel.
Use this to replace datasets.load_dataset, so
>>> dict_of_datasets = datasets.load_dataset('boolq')
becomes
>>> dict_of_datapanels = DataPanel.from_huggingface('boolq')
- classmethod from_jsonl(json_path: str) meerkat.datapanel.DataPanel[source]
Load a dataset from a .jsonl file on disk, where each line of the json file consists of a single example.
- classmethod from_pandas(df: pandas.core.frame.DataFrame)[source]
Create a Dataset from a pandas DataFrame.
- head(n: int = 5) meerkat.datapanel.DataPanel[source]
Get the first n examples of the DataPanel.
- map(function: Optional[Callable] = None, with_indices: bool = False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: int = 0, output_type: Union[type, Dict[str, type]] = None, mmap: bool = False, mmap_path: str = None, materialize: bool = True, pbar: bool = False, **kwargs) Optional[Union[Dict, List, meerkat.columns.abstract.AbstractColumn]][source]
- mean(*args, **kwargs) meerkat.datapanel.DataPanel[source]
- merge(right: meerkat.datapanel.DataPanel, how: str = 'inner', on: Optional[Union[str, List[str]]] = None, left_on: Optional[Union[str, List[str]]] = None, right_on: Optional[Union[str, List[str]]] = None, sort: bool = False, suffixes: Sequence[str] = ('_x', '_y'), validate=None)[source]
- classmethod read(path: str, *args, **kwargs) meerkat.datapanel.DataPanel[source]
Load a DataPanel stored on disk.
- sample(n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, weights: Optional[Union[str, numpy.ndarray]] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None) meerkat.datapanel.DataPanel[source]
Select a random sample of rows from DataPanel. Roughly equivalent to
samplein Pandas https://pandas.pydata.org/docs/reference/api/panda s.DataFrame.sample.html.- Parameters
n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.
frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.
replace (bool) – Sample with or without replacement. Defaults to False.
weights (Union[str, np.ndarray]) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If a string, the weights will be applied to the rows based on the column with the name specified. If weights do not sum to 1 they will be normalized to sum to 1.
random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.
- Returns
A random sample of rows from the DataPanel.
- Return type
- sort(by: Union[str, List[str]], ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.datapanel.DataPanel[source]
Sort the DataPanel by the values in the specified columns. Similar to
sort_valuesin pandas.- Parameters
by (Union[str, List[str]]) – The columns to sort by.
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by.Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A sorted view of DataPanel.
- Return type
- tail(n: int = 5) meerkat.datapanel.DataPanel[source]
Get the last n examples of the DataPanel.
- update(function: Optional[Callable] = None, with_indices: bool = False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, remove_columns: Optional[List[str]] = None, num_workers: int = 0, output_type: Union[type, Dict[str, type]] = None, mmap: bool = False, mmap_path: str = None, materialize: bool = True, pbar: bool = False, **kwargs) meerkat.datapanel.DataPanel[source]
Update the columns of the dataset.
- property columns
Column names in the DataPanel.
- property data: meerkat.block.manager.BlockManager
Get the underlying data (excluding invisible rows).
To access underlying data with invisible rows, use _data.
- logdir: pathlib.Path = PosixPath('/home/docs/meerkat')
- property ncols
Number of rows in the DataPanel.
- property nrows
Number of rows in the DataPanel.
- property shape
Shape of the DataPanel (num_rows, num_columns).
- class FileCell(transform: Optional[callable] = None, loader: Optional[callable] = None, data: Optional[str] = None, base_dir: Optional[str] = None)[source]
Bases:
meerkat.columns.file_column.FileLoaderMixin,meerkat.columns.lambda_column.LambdaCell- property absolute_path
- class FileColumn(data: Optional[Sequence[str]] = None, transform: Optional[callable] = None, loader: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.file_column.FileLoaderMixin,meerkat.columns.lambda_column.LambdaColumnA column where each cell represents an file stored on disk or the web. The underlying data is a PandasSeriesColumn of strings, where each string is the path to a file. The column materializes the files into memory when indexed. If the column is lazy indexed with the
lzindexer, the files are not materialized and aFileCellor aFileColumnis returned instead.- Parameters
data (Sequence[str]) – A list of filepaths to images.
transform (callable) –
A function that transforms the image (e.g.
torchvision.transforms.functional.center_crop).Warning
In order for the column to be serializable, the transform function must be pickleable.
loader (callable) –
A callable with signature
def loader(filepath: str) -> PIL.Image:. Defaults totorchvision.datasets.folder.default_loader.Warning
In order for the column to be serializable with
write(), the loader function must be pickleable.base_dir (str) – A base directory that the paths in
dataare relative to. IfNone, the paths are assumed to be absolute.
- classmethod from_filepaths(filepaths: Sequence[str], loader: Optional[callable] = None, transform: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- class ImageColumn(data: Optional[Sequence[str]] = None, transform: Optional[callable] = None, loader: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.file_column.FileColumnA column where each cell represents an image stored on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the
lzindexer, the images are not materialized and anImageCellor anImageColumnis returned instead.- Parameters
data (Sequence[str]) – A list of filepaths to images.
transform (callable) –
A function that transforms the image (e.g.
torchvision.transforms.functional.center_crop).Warning
In order for the column to be serializable, the transform function must be pickleable.
loader (callable) –
A callable with signature
def loader(filepath: str) -> PIL.Image:. Defaults totorchvision.datasets.folder.default_loader.Warning
In order for the column to be serializable with
write(), the loader function must be pickleable.base_dir (str) – A base directory that the paths in
dataare relative to. IfNone, the paths are assumed to be absolute.
- class LambdaCell(fn: Optional[callable] = None, data: Optional[any] = None)[source]
Bases:
meerkat.cells.abstract.AbstractCell- property data: object
Get the data associated with this cell.
- class LambdaColumn(data: Union[meerkat.datapanel.DataPanel, meerkat.columns.abstract.AbstractColumn], fn: Optional[callable] = None, output_type: Optional[type] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn- static concat(columns: Sequence[meerkat.columns.lambda_column.LambdaColumn])[source]
- fn(data: object)[source]
Subclasses like ImageColumn should be able to implement their own version.
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- class ListColumn(data: Optional[Sequence] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn- batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, *args, **kwargs)[source]
Batch the column.
- Parameters
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
collate – whether to collate the returned batches
- Returns
batches of data
- classmethod concat(columns: Sequence[meerkat.columns.list_column.ListColumn])[source]
- default_formatter()
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- class MedicalVolumeCell(paths: Union[str, pathlib.Path, os.PathLike, Sequence[Union[str, pathlib.Path, os.PathLike]]], loader: Optional[Callable] = None, transform: Optional[Callable] = None, cache_metadata: bool = False, *args, **kwargs)[source]
Bases:
meerkat.mixins.file.PathsMixin,meerkat.cells.abstract.AbstractCellInterface for loading medical volume data.
Examples
# Specify xray dicoms with default orientation
("SI", "AP"): >>> cell = MedicalVolumeCell(“/path/to/xray.dcm”, loader=DicomReader(group_by=None, default_ornt=(“SI”, “AP”))# Load multi-echo MRI volumes >>> cell = MedicalVolumeCell(“/path/to/mri/scan/dir”, loader=DicomReader(group_by=”EchoNumbers”))
- get(*args, cache_metadata: Optional[bool] = None, **kwargs)[source]
Get me the thing that this cell exists for.
- class NumpyArrayColumn(data: Sequence, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn,numpy.lib.mixins.NDArrayOperatorsMixin- block_class
alias of
meerkat.block.numpy_block.NumpyBlock
- argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.numpy_column.NumpyArrayColumn[source]
Return indices that would sorted the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
NumpySeriesColumn
For now! Raises error when shape of input array is more than one error.
- classmethod concat(columns: Sequence[meerkat.columns.numpy_column.NumpyArrayColumn])[source]
- classmethod from_npy(path, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')[source]
- classmethod get_writer(mmap: bool = False, template: Optional[meerkat.columns.abstract.AbstractColumn] = None)[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- sort(ascending: Union[bool, List[bool]] = True, axis: int = - 1, kind: str = 'quicksort', order: Optional[Union[str, List[str]]] = None) meerkat.columns.numpy_column.NumpyArrayColumn[source]
Return a sorted view of the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
- to_tensor() torch.Tensor[source]
Use column.to_tensor() instead of torch.tensor(column), which is very slow.
- property is_mmap
- class PandasSeriesColumn(data: Optional[Sequence] = None, collate_fn: Optional[Callable] = None, formatter: Optional[Callable] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn,numpy.lib.mixins.NDArrayOperatorsMixin- block_class
- cat
alias of
meerkat.columns.pandas_column._MeerkatCategoricalAccessor
- dt
alias of
meerkat.columns.pandas_column._MeerkatCombinedDatetimelikeProperties
- str
alias of
meerkat.columns.pandas_column._MeerkatStringMethods
- argsort(ascending: bool = True, kind: str = 'quicksort') meerkat.columns.pandas_column.PandasSeriesColumn[source]
Return indices that would sorted the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
PandasSeriesColumn
For now! Raises error when shape of input array is more than one error.
- classmethod concat(columns: Sequence[meerkat.columns.pandas_column.PandasSeriesColumn])[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.pandas_column.PandasSeriesColumn[source]
Return a sorted view of the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
- class SpacyColumn(data: Sequence[spacy_tokens.Doc] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.list_column.ListColumn- classmethod from_texts(texts: Sequence[str], lang: str = 'en_core_web_sm', *args, **kwargs)[source]
- classmethod read(path: str, nlp: spacy.language.Language = None, lang: str = None, *args, **kwargs) SpacyColumn[source]
- property docs
- property tokens
- class TensorColumn(data: Optional[Sequence] = None, *args, **kwargs)[source]
Bases:
numpy.lib.mixins.NDArrayOperatorsMixin,meerkat.columns.abstract.AbstractColumn- block_class
- argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor_column.TensorColumn[source]
Return indices that would sorted the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
For now! Raises error when shape of input array is more than one error.
- classmethod concat(columns: Sequence[meerkat.columns.tensor_column.TensorColumn])[source]
- classmethod from_data(data: Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor, meerkat.columns.abstract.AbstractColumn])[source]
Convert data to an EmbeddingColumn.
- classmethod get_writer(mmap: bool = False, template: Optional[meerkat.columns.abstract.AbstractColumn] = None)[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor_column.TensorColumn[source]
Return a sorted view of the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
- concat(objs: Union[Sequence[meerkat.datapanel.DataPanel], Sequence[meerkat.columns.abstract.AbstractColumn]], axis: Union[str, int] = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) Union[meerkat.datapanel.DataPanel, meerkat.columns.abstract.AbstractColumn][source]
Concatenate a sequence of columns or a sequence of DataPanel`s. If sequence is empty, returns an empty `DataPanel.
If concatenating columns, all columns must be of the same type (e.g. all
ListColumn). - If concatenating `DataPanel`s along axis 0 (rows), all `DataPanel`s must have the same set of columns. - If concatenating `DataPanel`s along axis 1 (columns), all `DataPanel`s must have the same length and cannot have any of the same column names.
- Parameters
objs (Union[Sequence[DataPanel], Sequence[AbstractColumn]]) – sequence of columns or DataPanels.
axis (Union[str, int]) – The axis along which to concatenate. Ignored if concatenating columns.
- Returns
concatenated DataPanel or column
- Return type
Union[DataPanel, AbstractColumn]
- embed(data: meerkat.datapanel.DataPanel, input: str, encoder: Union[str, meerkat.ops.embed.encoder.Encoder] = 'clip', modality: Optional[str] = None, out_col: Optional[str] = None, device: Union[int, str] = 'cpu', mmap_dir: Optional[str] = None, num_workers: int = 4, batch_size: int = 128, **kwargs) meerkat.datapanel.DataPanel[source]
Embed a column of data with an encoder from the encoder registry.
Examples
Suppose you have an Image dataset (e.g. Imagenette, CIFAR-10) loaded into a Meerkat DataPanel. You can embed the images in the dataset with CLIP using a code snippet like:
import meerkat as mk dp = mk.datasets.get("imagenette") dp = mk.embed( data=dp, input_col="img", encoder="clip" )
- Parameters
data (mk.DataPanel) – A DataPanel containing the data to embed.
input_col (str) – The name of the column to embed.
encoder (Union[str, Encoder], optional) – Name of the encoder to use. List supported encoders with
domino.encoders. Defaults to “clip”. Alternatively, pass anEncoderobject containing a custom encoder.modality (str, optional) – The modality of the data to be embedded. Defaults to None, in which case the modality is inferred from the type of the input column.
out_col (str, optional) – The name of the column where the embeddings are stored. Defaults to None, in which case it is
"{encoder}({input_col})".device (Union[int, str], optional) – The device on which. Defaults to “cpu”.
mmap_dir (str, optional) – The path to directory where a memory-mapped file containing the embeddings will be written. Defaults to None, in which case the embeddings are not memmapped.
num_workers (int, optional) – Number of worker processes used to load the data from disk. Defaults to 4.
batch_size (int, optional) – Size of the batches to used . Defaults to 128.
**kwargs – Additional keyword arguments are passed to the encoder. To see supported arguments for each encoder, see the encoder documentation (e.g.
clip()).
- Returns
A view of
datawith a new column containing the embeddings. This column will be named according to theout_colparameter.- Return type
mk.DataPanel
- get(name: str, dataset_dir: Optional[str] = None, version: Optional[str] = None, download_mode: str = 'reuse', registry: Optional[str] = None, **kwargs) Union[meerkat.datapanel.DataPanel, Dict[str, meerkat.datapanel.DataPanel]][source]
Load a dataset into .
- Parameters
name (str) – Name of the dataset.
dataset_dir (str) – The directory containing dataset data. Defaults to ~/.meerkat/datasets/{name}.
version (str) – The version of the dataset. Defaults to latest.
download_mode (str) – The download mode. Options are: “reuse” (default) will download the dataset if it does not exist, “force” will download the dataset even if it exists, “extract” will reuse any downloaded archives but force extracting those archives, and “skip” will not download the dataset if it doesn’t yet exist. Defaults to reuse.
**kwargs – Additional arguments passed to the dataset.
- merge(left: meerkat.datapanel.DataPanel, right: meerkat.datapanel.DataPanel, how: str = 'inner', on: Union[str, List[str]] = None, left_on: Union[str, List[str]] = None, right_on: Union[str, List[str]] = None, sort: bool = False, suffixes: Sequence[str] = ('_x', '_y'), validate=None)[source]
- sample(data: Union[meerkat.datapanel.DataPanel, meerkat.columns.abstract.AbstractColumn], n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, weights: Optional[Union[str, numpy.ndarray]] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None) Union[meerkat.datapanel.DataPanel, meerkat.columns.abstract.AbstractColumn][source]
Select a random sample of rows from DataPanel or Column. Roughly equivalent to
samplein Pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html.- Parameters
data (Union[DataPanel, AbstractColumn]) – DataPanel or Column to sample from.
n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.
frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.
replace (bool) – Sample with or without replacement. Defaults to False.
weights (Union[str, np.ndarray]) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If a string and data is a DataPanel, the weights will be applied to the rows based on the column with the name specified. If weights do not sum to 1 they will be normalized to sum to 1.
random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.
- Returns
- A random sample of rows from DataPanel or
Column.
- Return type
Union[DataPanel, AbstractColumn]
- sort(data: meerkat.datapanel.DataPanel, by: Union[str, List[str]], ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.datapanel.DataPanel[source]
Sort a DataPanel or Column. If a DataPanel, sort by the values in the specified columns. Similar to
sort_valuesin pandas.- Parameters
data (Union[DataPanel, AbstractColumn]) – DataPanel or Column to sort.
by (Union[str, List[str]]) – The columns to sort by. Ignored if data is a Column.
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by.Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A sorted view of DataPanel.
- Return type