meerkat package#
Meerkat.
- class AbstractCell(*args, **kwargs)[source]#
Bases:
ABC
- property metadata: dict#
Get the metadata associated with this cell.
- class AbstractColumn(data: Sequence | None = None, collate_fn: Callable | None = None, formatter: Callable | None = None, *args, **kwargs)[source]#
Bases:
BlockableMixin
,CloneableMixin
,CollateMixin
,ColumnIOMixin
,FunctionInspectorMixin
,LambdaMixin
,MappableMixin
,MaterializationMixin
,ProvenanceMixin
,ABC
An abstract class for Meerkat columns.
- append(column: AbstractColumn) None [source]#
- batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, num_workers: int = 0, materialize: bool = True, *args, **kwargs)[source]#
Batch the column.
- Parameters:
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
collate – whether to collate the returned batches
- Returns:
batches of data
- static concat(columns: Sequence[AbstractColumn]) None [source]#
- filter(function: Callable, with_indices=False, input_columns: str | List[str] | None = None, is_batched_fn: bool = False, batch_size: int | None = 1, drop_last_batch: bool = False, num_workers: int | None = 0, materialize: bool = True, pbar: bool = False, **kwargs) AbstractColumn | None [source]#
Filter the elements of the column using a function.
- classmethod from_data(data: Columnable | AbstractColumn)[source]#
Convert data to a meerkat column using the appropriate Column type.
- classmethod get_writer(mmap: bool = False, template: AbstractColumn | None = None)[source]#
- head(n: int = 5) AbstractColumn [source]#
Get the first n examples of the column.
- is_equal(other: AbstractColumn) bool [source]#
Tests whether two columns.
- Parameters:
other (AbstractColumn) – [description]
- tail(n: int = 5) AbstractColumn [source]#
Get the last n examples of the column.
- Columnable#
alias of
Union
[Sequence
,ndarray
,Series
,Tensor
]
- property data#
Get the underlying data.
- property formatter: Callable#
- property is_mmap#
- logdir: Path = PosixPath('/home/docs/meerkat')#
- property metadata#
- class ArrowArrayColumn(data: Sequence, *args, **kwargs)[source]#
Bases:
AbstractColumn
- block_class#
alias of
ArrowBlock
- classmethod concat(columns: Sequence[ArrowArrayColumn])[source]#
- is_equal(other: AbstractColumn) bool [source]#
Tests whether two columns.
- Parameters:
other (AbstractColumn) – [description]
- class AudioColumn(data: Sequence[str] | None = None, transform: callable | None = None, loader: callable | None = None, base_dir: str | None = None, *args, **kwargs)[source]#
Bases:
FileColumn
A lambda column where each cell represents an audio file on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the
lz
indexer, the images are not materialized and anFileCell
or anAudioColumn
is returned instead.- Parameters:
data (Sequence[str]) – A list of filepaths to images.
transform (callable) –
A function that transforms the image (e.g.
torchvision.transforms.functional.center_crop
).Warning
In order for the column to be serializable, the transform function must be pickleable.
loader (callable) –
A callable with signature
def loader(filepath: str) -> PIL.Image:
. Defaults totorchvision.datasets.folder.default_loader
.Warning
In order for the column to be serializable with
write()
, the loader function must be pickleable.base_dir (str) – A base directory that the paths in
data
are relative to. IfNone
, the paths are assumed to be absolute.
- class CellColumn(cells: Sequence[AbstractCell] | None = None, *args, **kwargs)[source]#
Bases:
AbstractColumn
- static concat(columns: Sequence[CellColumn])[source]#
- classmethod from_cells(cells: Sequence[AbstractCell], *args, **kwargs)[source]#
- is_equal(other: AbstractColumn) bool [source]#
Tests whether two columns.
- Parameters:
other (AbstractColumn) – [description]
- property cells#
- class DataPanel(data: dict | list | Dataset | None = None, *args, **kwargs)[source]#
Bases:
CloneableMixin
,FunctionInspectorMixin
,LambdaMixin
,MappableMixin
,MaterializationMixin
,ProvenanceMixin
Meerkat DataPanel class.
- add_column(name: str, data: Sequence | ndarray | Series | Tensor, overwrite=False) None [source]#
Add a column to the DataPanel.
- append(dp: DataPanel, axis: str | int = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) DataPanel [source]#
Append a batch of data to the dataset.
example_or_batch must have the same columns as the dataset (regardless of what columns are visible).
- batch(batch_size: int = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, shuffle: bool = False, *args, **kwargs)[source]#
Batch the dataset. TODO:
- Parameters:
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
- Returns:
batches of data
- filter(function: Callable | None = None, with_indices=False, input_columns: str | List[str] | None = None, is_batched_fn: bool = False, batch_size: int | None = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, pbar: bool = False, **kwargs) DataPanel | None [source]#
Filter operation on the DataPanel.
- classmethod from_batch(batch: Dict[str, List | AbstractColumn]) DataPanel [source]#
Convert a batch to a Dataset.
- classmethod from_batches(batches: Sequence[Dict[str, List | AbstractColumn]]) DataPanel [source]#
Convert a list of batches to a dataset.
- classmethod from_csv(filepath: str, *args, **kwargs)[source]#
Create a Dataset from a csv file.
- Parameters:
filepath (str) – The file path or buffer to load from. Same as
pandas.read_csv()
.*args – Argument list for
pandas.read_csv()
.**kwargs – Keyword arguments for
pandas.read_csv()
.
- Returns:
The constructed datapanel.
- Return type:
- classmethod from_dict(d: Dict) DataPanel [source]#
Convert a dictionary to a dataset.
Alias for Dataset.from_batch(..).
- classmethod from_huggingface(*args, **kwargs)[source]#
Load a Huggingface dataset as a DataPanel.
Use this to replace datasets.load_dataset, so
>>> dict_of_datasets = datasets.load_dataset('boolq')
becomes
>>> dict_of_datapanels = DataPanel.from_huggingface('boolq')
- classmethod from_jsonl(json_path: str) DataPanel [source]#
Load a dataset from a .jsonl file on disk, where each line of the json file consists of a single example.
- map(function: Callable | None = None, with_indices: bool = False, input_columns: str | List[str] | None = None, is_batched_fn: bool = False, batch_size: int | None = 1, drop_last_batch: bool = False, num_workers: int = 0, output_type: type | Dict[str, type] = None, mmap: bool = False, mmap_path: str = None, materialize: bool = True, pbar: bool = False, **kwargs) Dict | List | AbstractColumn | None [source]#
- merge(right: DataPanel, how: str = 'inner', on: str | List[str] | None = None, left_on: str | List[str] | None = None, right_on: str | List[str] | None = None, sort: bool = False, suffixes: Sequence[str] = ('_x', '_y'), validate=None)[source]#
- update(function: Callable | None = None, with_indices: bool = False, input_columns: str | List[str] | None = None, is_batched_fn: bool = False, batch_size: int | None = 1, remove_columns: List[str] | None = None, num_workers: int = 0, output_type: type | Dict[str, type] = None, mmap: bool = False, mmap_path: str = None, materialize: bool = True, pbar: bool = False, **kwargs) DataPanel [source]#
Update the columns of the dataset.
- property columns#
Column names in the DataPanel.
- property data: BlockManager#
Get the underlying data (excluding invisible rows).
To access underlying data with invisible rows, use _data.
- logdir: Path = PosixPath('/home/docs/meerkat')#
- property ncols#
Number of rows in the DataPanel.
- property nrows#
Number of rows in the DataPanel.
- property shape#
Shape of the DataPanel (num_rows, num_columns).
- class FileCell(transform: callable | None = None, loader: callable | None = None, data: str | None = None, base_dir: str | None = None)[source]#
Bases:
FileLoaderMixin
,LambdaCell
- property absolute_path#
- class FileColumn(data: Sequence[str] | None = None, transform: callable | None = None, loader: callable | None = None, base_dir: str | None = None, *args, **kwargs)[source]#
Bases:
FileLoaderMixin
,LambdaColumn
A column where each cell represents an file stored on disk or the web. The underlying data is a PandasSeriesColumn of strings, where each string is the path to a file. The column materializes the files into memory when indexed. If the column is lazy indexed with the
lz
indexer, the files are not materialized and aFileCell
or aFileColumn
is returned instead.- Parameters:
data (Sequence[str]) – A list of filepaths to images.
transform (callable) –
A function that transforms the image (e.g.
torchvision.transforms.functional.center_crop
).Warning
In order for the column to be serializable, the transform function must be pickleable.
loader (callable) –
A callable with signature
def loader(filepath: str) -> PIL.Image:
. Defaults totorchvision.datasets.folder.default_loader
.Warning
In order for the column to be serializable with
write()
, the loader function must be pickleable.base_dir (str) – A base directory that the paths in
data
are relative to. IfNone
, the paths are assumed to be absolute.
- classmethod from_filepaths(filepaths: Sequence[str], loader: callable | None = None, transform: callable | None = None, base_dir: str | None = None, *args, **kwargs)[source]#
- is_equal(other: AbstractColumn) bool [source]#
Tests whether two columns.
- Parameters:
other (AbstractColumn) – [description]
- class ImageColumn(data: Sequence[str] | None = None, transform: callable | None = None, loader: callable | None = None, base_dir: str | None = None, *args, **kwargs)[source]#
Bases:
FileColumn
A column where each cell represents an image stored on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the
lz
indexer, the images are not materialized and anImageCell
or anImageColumn
is returned instead.- Parameters:
data (Sequence[str]) – A list of filepaths to images.
transform (callable) –
A function that transforms the image (e.g.
torchvision.transforms.functional.center_crop
).Warning
In order for the column to be serializable, the transform function must be pickleable.
loader (callable) –
A callable with signature
def loader(filepath: str) -> PIL.Image:
. Defaults totorchvision.datasets.folder.default_loader
.Warning
In order for the column to be serializable with
write()
, the loader function must be pickleable.base_dir (str) – A base directory that the paths in
data
are relative to. IfNone
, the paths are assumed to be absolute.
- class LambdaCell(fn: callable | None = None, data: any | None = None)[source]#
Bases:
AbstractCell
- property data: object#
Get the data associated with this cell.
- class LambdaColumn(data: DataPanel | AbstractColumn, fn: callable | None = None, output_type: type | None = None, *args, **kwargs)[source]#
Bases:
AbstractColumn
- static concat(columns: Sequence[LambdaColumn])[source]#
- fn(data: object)[source]#
Subclasses like ImageColumn should be able to implement their own version.
- is_equal(other: AbstractColumn) bool [source]#
Tests whether two columns.
- Parameters:
other (AbstractColumn) – [description]
- class ListColumn(data: Sequence | None = None, *args, **kwargs)[source]#
Bases:
AbstractColumn
- batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, *args, **kwargs)[source]#
Batch the column.
- Parameters:
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
collate – whether to collate the returned batches
- Returns:
batches of data
- classmethod concat(columns: Sequence[ListColumn])[source]#
- default_formatter()#
- is_equal(other: AbstractColumn) bool [source]#
Tests whether two columns.
- Parameters:
other (AbstractColumn) – [description]
- class MedicalVolumeCell(paths: str | Path | PathLike | Sequence[str | Path | PathLike], loader: Callable | None = None, transform: Callable | None = None, cache_metadata: bool = False, *args, **kwargs)[source]#
Bases:
PathsMixin
,AbstractCell
Interface for loading medical volume data.
Examples
# Specify xray dicoms with default orientation
("SI", "AP")
: >>> cell = MedicalVolumeCell(“/path/to/xray.dcm”, loader=DicomReader(group_by=None, default_ornt=(“SI”, “AP”))# Load multi-echo MRI volumes >>> cell = MedicalVolumeCell(“/path/to/mri/scan/dir”, loader=DicomReader(group_by=”EchoNumbers”))
- get(*args, cache_metadata: bool | None = None, **kwargs)[source]#
Get me the thing that this cell exists for.
- class MedicalVolumeColumn(*args, **kwargs)[source]#
Bases:
CellColumn
- class NumpyArrayColumn(data: Sequence, *args, **kwargs)[source]#
Bases:
AbstractColumn
,NDArrayOperatorsMixin
- block_class#
alias of
NumpyBlock
- classmethod concat(columns: Sequence[NumpyArrayColumn])[source]#
- classmethod from_npy(path, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')[source]#
- classmethod get_writer(mmap: bool = False, template: AbstractColumn | None = None)[source]#
- is_equal(other: AbstractColumn) bool [source]#
Tests whether two columns.
- Parameters:
other (AbstractColumn) – [description]
- to_tensor() Tensor [source]#
Use column.to_tensor() instead of torch.tensor(column), which is very slow.
- property is_mmap#
- class PandasSeriesColumn(data: Sequence | None = None, collate_fn: Callable | None = None, formatter: Callable | None = None, *args, **kwargs)[source]#
Bases:
AbstractColumn
,NDArrayOperatorsMixin
- block_class#
alias of
PandasBlock
- cat#
alias of
_MeerkatCategoricalAccessor
- dt#
alias of
_MeerkatCombinedDatetimelikeProperties
- str#
alias of
_MeerkatStringMethods
- classmethod concat(columns: Sequence[PandasSeriesColumn])[source]#
- is_equal(other: AbstractColumn) bool [source]#
Tests whether two columns.
- Parameters:
other (AbstractColumn) – [description]
- class SpacyColumn(data: Sequence[spacy_tokens.Doc] = None, *args, **kwargs)[source]#
Bases:
ListColumn
- classmethod from_texts(texts: Sequence[str], lang: str = 'en_core_web_sm', *args, **kwargs)[source]#
- classmethod read(path: str, nlp: spacy.language.Language = None, lang: str = None, *args, **kwargs) SpacyColumn [source]#
- property docs#
- property tokens#
- class TensorColumn(data: Sequence | None = None, *args, **kwargs)[source]#
Bases:
NDArrayOperatorsMixin
,AbstractColumn
- block_class#
alias of
TensorBlock
- classmethod concat(columns: Sequence[TensorColumn])[source]#
- classmethod from_data(data: Sequence | ndarray | Series | Tensor | AbstractColumn)[source]#
Convert data to an EmbeddingColumn.
- classmethod get_writer(mmap: bool = False, template: AbstractColumn | None = None)[source]#
- is_equal(other: AbstractColumn) bool [source]#
Tests whether two columns.
- Parameters:
other (AbstractColumn) – [description]
- class VideoColumn(*args, **kwargs)[source]#
Bases:
CellColumn
Interface for creating a CellColumn from VideoCell objects.
- concat(objs: Sequence[DataPanel] | Sequence[AbstractColumn], axis: str | int = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) DataPanel | AbstractColumn [source]#
Concatenate a sequence of columns or a sequence of DataPanel`s. If sequence is empty, returns an empty `DataPanel.
If concatenating columns, all columns must be of the same type (e.g. all
ListColumn). - If concatenating `DataPanel`s along axis 0 (rows), all `DataPanel`s must have the same set of columns. - If concatenating `DataPanel`s along axis 1 (columns), all `DataPanel`s must have the same length and cannot have any of the same column names.
- Parameters:
objs (Union[Sequence[DataPanel], Sequence[AbstractColumn]]) – sequence of columns or DataPanels.
axis (Union[str, int]) – The axis along which to concatenate. Ignored if concatenating columns.
- Returns:
concatenated DataPanel or column
- Return type:
Union[DataPanel, AbstractColumn]
- merge(left: DataPanel, right: DataPanel, how: str = 'inner', on: str | List[str] = None, left_on: str | List[str] = None, right_on: str | List[str] = None, sort: bool = False, suffixes: Sequence[str] = ('_x', '_y'), validate=None)[source]#
Subpackages#
- meerkat.block package
- Submodules
- meerkat.block.abstract module
- meerkat.block.arrow_block module
- meerkat.block.manager module
BlockManager
BlockManager.add_column()
BlockManager.apply()
BlockManager.consolidate()
BlockManager.copy()
BlockManager.from_dict()
BlockManager.get_block_ref()
BlockManager.read()
BlockManager.remove()
BlockManager.reorder()
BlockManager.update()
BlockManager.view()
BlockManager.write()
BlockManager.ncols
BlockManager.nrows
- meerkat.block.numpy_block module
- meerkat.block.pandas_block module
- meerkat.block.ref module
- meerkat.block.tensor_block module
- meerkat.cells package
- meerkat.columns package
- Submodules
- meerkat.columns.abstract module
AbstractColumn
AbstractColumn.append()
AbstractColumn.batch()
AbstractColumn.concat()
AbstractColumn.filter()
AbstractColumn.from_data()
AbstractColumn.full_length()
AbstractColumn.get_writer()
AbstractColumn.head()
AbstractColumn.is_equal()
AbstractColumn.streamlit()
AbstractColumn.tail()
AbstractColumn.to_pandas()
AbstractColumn.Columnable
AbstractColumn.data
AbstractColumn.formatter
AbstractColumn.is_mmap
AbstractColumn.logdir
AbstractColumn.metadata
- meerkat.columns.arrow_column module
- meerkat.columns.cell_column module
- meerkat.columns.image_column module
- meerkat.columns.lambda_column module
- meerkat.columns.list_column module
- meerkat.columns.numpy_column module
- meerkat.columns.pandas_column module
- meerkat.columns.spacy_column module
- meerkat.columns.tensor_column module
- meerkat.columns.video_column module
- meerkat.columns.volume_column module
- meerkat.contrib package
- Subpackages
- Submodules
- meerkat.contrib.celeba module
- meerkat.contrib.dew module
- meerkat.contrib.imagenet module
- meerkat.contrib.imagenette module
- meerkat.contrib.registry module
- meerkat.contrib.siim_cxr module
- meerkat.contrib.visual_genome module
- meerkat.logging package
- meerkat.mixins package
- meerkat.ml package
- Submodules
- meerkat.ml.activation module
- meerkat.ml.callbacks module
- meerkat.ml.embedding_column module
- meerkat.ml.huggingfacemodel module
- meerkat.ml.instances_column module
- meerkat.ml.metrics module
- meerkat.ml.model module
- meerkat.ml.prediction_column module
- meerkat.ml.segmentation_column module
- meerkat.ml.tensormodel module
- meerkat.ops package
- meerkat.pipelines package
- meerkat.tools package
- meerkat.writers package
Submodules#
meerkat.config module#
meerkat.datapanel module#
DataPanel class.
- class DataPanel(data: dict | list | Dataset | None = None, *args, **kwargs)[source]#
Bases:
CloneableMixin
,FunctionInspectorMixin
,LambdaMixin
,MappableMixin
,MaterializationMixin
,ProvenanceMixin
Meerkat DataPanel class.
- add_column(name: str, data: Sequence | ndarray | Series | Tensor, overwrite=False) None [source]#
Add a column to the DataPanel.
- append(dp: DataPanel, axis: str | int = 'rows', suffixes: Tuple[str] = None, overwrite: bool = False) DataPanel [source]#
Append a batch of data to the dataset.
example_or_batch must have the same columns as the dataset (regardless of what columns are visible).
- batch(batch_size: int = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, shuffle: bool = False, *args, **kwargs)[source]#
Batch the dataset. TODO:
- Parameters:
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
- Returns:
batches of data
- filter(function: Callable | None = None, with_indices=False, input_columns: str | List[str] | None = None, is_batched_fn: bool = False, batch_size: int | None = 1, drop_last_batch: bool = False, num_workers: int = 0, materialize: bool = True, pbar: bool = False, **kwargs) DataPanel | None [source]#
Filter operation on the DataPanel.
- classmethod from_batch(batch: Dict[str, List | AbstractColumn]) DataPanel [source]#
Convert a batch to a Dataset.
- classmethod from_batches(batches: Sequence[Dict[str, List | AbstractColumn]]) DataPanel [source]#
Convert a list of batches to a dataset.
- classmethod from_csv(filepath: str, *args, **kwargs)[source]#
Create a Dataset from a csv file.
- Parameters:
filepath (str) – The file path or buffer to load from. Same as
pandas.read_csv()
.*args – Argument list for
pandas.read_csv()
.**kwargs – Keyword arguments for
pandas.read_csv()
.
- Returns:
The constructed datapanel.
- Return type:
- classmethod from_dict(d: Dict) DataPanel [source]#
Convert a dictionary to a dataset.
Alias for Dataset.from_batch(..).
- classmethod from_huggingface(*args, **kwargs)[source]#
Load a Huggingface dataset as a DataPanel.
Use this to replace datasets.load_dataset, so
>>> dict_of_datasets = datasets.load_dataset('boolq')
becomes
>>> dict_of_datapanels = DataPanel.from_huggingface('boolq')
- classmethod from_jsonl(json_path: str) DataPanel [source]#
Load a dataset from a .jsonl file on disk, where each line of the json file consists of a single example.
- map(function: Callable | None = None, with_indices: bool = False, input_columns: str | List[str] | None = None, is_batched_fn: bool = False, batch_size: int | None = 1, drop_last_batch: bool = False, num_workers: int = 0, output_type: type | Dict[str, type] = None, mmap: bool = False, mmap_path: str = None, materialize: bool = True, pbar: bool = False, **kwargs) Dict | List | AbstractColumn | None [source]#
- merge(right: DataPanel, how: str = 'inner', on: str | List[str] | None = None, left_on: str | List[str] | None = None, right_on: str | List[str] | None = None, sort: bool = False, suffixes: Sequence[str] = ('_x', '_y'), validate=None)[source]#
- update(function: Callable | None = None, with_indices: bool = False, input_columns: str | List[str] | None = None, is_batched_fn: bool = False, batch_size: int | None = 1, remove_columns: List[str] | None = None, num_workers: int = 0, output_type: type | Dict[str, type] = None, mmap: bool = False, mmap_path: str = None, materialize: bool = True, pbar: bool = False, **kwargs) DataPanel [source]#
Update the columns of the dataset.
- property columns#
Column names in the DataPanel.
- property data: BlockManager#
Get the underlying data (excluding invisible rows).
To access underlying data with invisible rows, use _data.
- logdir: Path = PosixPath('/home/docs/meerkat')#
- property ncols#
Number of rows in the DataPanel.
- property nrows#
Number of rows in the DataPanel.
- property shape#
Shape of the DataPanel (num_rows, num_columns).
meerkat.errors module#
meerkat.provenance module#
- class ProvenanceNode[source]#
Bases:
object
- add_child(node: ProvenanceNode, key: Tuple)[source]#
- add_parent(node: ProvenanceNode, key: Tuple)[source]#
- property children#
- property last_parent#
- property parents#
- class ProvenanceObjNode(obj: ProvenanceMixin)[source]#
Bases:
ProvenanceNode
- class ProvenanceOpNode(fn: callable, inputs: dict, outputs: object, captured_args: dict)[source]#
Bases:
ProvenanceNode
- visualize_provenance(obj: ProvenanceObjNode | ProvenanceOpNode, show_columns: bool = False, last_parent_only: bool = False)[source]#