meerkat.columns package
Submodules
meerkat.columns.abstract module
- class AbstractColumn(data: Optional[Sequence] = None, collate_fn: Optional[Callable] = None, formatter: Optional[Callable] = None, *args, **kwargs)[source]
Bases:
meerkat.mixins.blockable.BlockableMixin,meerkat.mixins.cloneable.CloneableMixin,meerkat.mixins.collate.CollateMixin,meerkat.mixins.io.ColumnIOMixin,meerkat.mixins.inspect_fn.FunctionInspectorMixin,meerkat.mixins.lambdable.LambdaMixin,meerkat.mixins.mapping.MappableMixin,meerkat.mixins.materialize.MaterializationMixin,meerkat.provenance.ProvenanceMixin,abc.ABCAn abstract class for Meerkat columns.
- append(column: meerkat.columns.abstract.AbstractColumn) None[source]
- argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.AbstractColumn[source]
Return indices that would sorted the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
- batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, num_workers: int = 0, materialize: bool = True, *args, **kwargs)[source]
Batch the column.
- Parameters
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
collate – whether to collate the returned batches
- Returns
batches of data
- static concat(columns: Sequence[meerkat.columns.abstract.AbstractColumn]) None[source]
- filter(function: Callable, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, is_batched_fn: bool = False, batch_size: Optional[int] = 1, drop_last_batch: bool = False, num_workers: Optional[int] = 0, materialize: bool = True, pbar: bool = False, **kwargs) Optional[meerkat.columns.abstract.AbstractColumn][source]
Filter the elements of the column using a function.
- classmethod from_data(data: Union[Columnable, AbstractColumn])[source]
Convert data to a meerkat column using the appropriate Column type.
- classmethod get_writer(mmap: bool = False, template: Optional[meerkat.columns.abstract.AbstractColumn] = None)[source]
- head(n: int = 5) meerkat.columns.abstract.AbstractColumn[source]
Get the first n examples of the column.
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- sample(n: Optional[int] = None, frac: Optional[float] = None, replace: bool = False, weights: Optional[Union[str, numpy.ndarray]] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None) meerkat.columns.abstract.AbstractColumn[source]
Select a random sample of rows from Column. Roughly equivalent to
samplein Pandas https://pandas.pydata.org/docs/reference/api/panda s.DataFrame.sample.html.- Parameters
n (int) – Number of samples to draw. If frac is specified, this parameter should not be passed. Defaults to 1 if frac is not passed.
frac (float) – Fraction of rows to sample. If n is specified, this parameter should not be passed.
replace (bool) – Sample with or without replacement. Defaults to False.
weights (np.ndarray) – Weights to use for sampling. If None (default), the rows will be sampled uniformly. If a numpy array, the sample will be weighted accordingly. If weights do not sum to 1 they will be normalized to sum to 1.
random_state (Union[int, np.random.RandomState]) – Random state or seed to use for sampling.
- Returns
A random sample of rows from the DataPanel.
- Return type
- sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.abstract.AbstractColumn[source]
Return a sorted view of the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
- tail(n: int = 5) meerkat.columns.abstract.AbstractColumn[source]
Get the last n examples of the column.
- Columnable
alias of
Union[Sequence,numpy.ndarray,pandas.core.series.Series,torch.Tensor]
- property data
Get the underlying data.
- property formatter: Callable
- property is_mmap
- logdir: pathlib.Path = PosixPath('/home/docs/meerkat')
- property metadata
meerkat.columns.arrow_column module
- class ArrowArrayColumn(data: Sequence, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn- block_class
alias of
meerkat.block.arrow_block.ArrowBlock
- classmethod concat(columns: Sequence[meerkat.columns.arrow_column.ArrowArrayColumn])[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
meerkat.columns.audio_column module
- class AudioColumn(data: Optional[Sequence[str]] = None, transform: Optional[callable] = None, loader: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.file_column.FileColumnA lambda column where each cell represents an audio file on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the
lzindexer, the images are not materialized and anFileCellor anAudioColumnis returned instead.- Parameters
data (Sequence[str]) – A list of filepaths to images.
transform (callable) –
A function that transforms the image (e.g.
torchvision.transforms.functional.center_crop).Warning
In order for the column to be serializable, the transform function must be pickleable.
loader (callable) –
A callable with signature
def loader(filepath: str) -> PIL.Image:. Defaults totorchvision.datasets.folder.default_loader.Warning
In order for the column to be serializable with
write(), the loader function must be pickleable.base_dir (str) – A base directory that the paths in
dataare relative to. IfNone, the paths are assumed to be absolute.
meerkat.columns.cell_column module
- class CellColumn(cells: Optional[Sequence[meerkat.cells.abstract.AbstractCell]] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn- static concat(columns: Sequence[meerkat.columns.cell_column.CellColumn])[source]
- classmethod from_cells(cells: Sequence[meerkat.cells.abstract.AbstractCell], *args, **kwargs)[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- property cells
meerkat.columns.file_column module
- class FileCell(transform: Optional[callable] = None, loader: Optional[callable] = None, data: Optional[str] = None, base_dir: Optional[str] = None)[source]
Bases:
meerkat.columns.file_column.FileLoaderMixin,meerkat.columns.lambda_column.LambdaCell- property absolute_path
- class FileColumn(data: Optional[Sequence[str]] = None, transform: Optional[callable] = None, loader: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.file_column.FileLoaderMixin,meerkat.columns.lambda_column.LambdaColumnA column where each cell represents an file stored on disk or the web. The underlying data is a PandasSeriesColumn of strings, where each string is the path to a file. The column materializes the files into memory when indexed. If the column is lazy indexed with the
lzindexer, the files are not materialized and aFileCellor aFileColumnis returned instead.- Parameters
data (Sequence[str]) – A list of filepaths to images.
transform (callable) –
A function that transforms the image (e.g.
torchvision.transforms.functional.center_crop).Warning
In order for the column to be serializable, the transform function must be pickleable.
loader (callable) –
A callable with signature
def loader(filepath: str) -> PIL.Image:. Defaults totorchvision.datasets.folder.default_loader.Warning
In order for the column to be serializable with
write(), the loader function must be pickleable.base_dir (str) – A base directory that the paths in
dataare relative to. IfNone, the paths are assumed to be absolute.
- classmethod from_filepaths(filepaths: Sequence[str], loader: Optional[callable] = None, transform: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
meerkat.columns.image_column module
- class ImageColumn(data: Optional[Sequence[str]] = None, transform: Optional[callable] = None, loader: Optional[callable] = None, base_dir: Optional[str] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.file_column.FileColumnA column where each cell represents an image stored on disk. The underlying data is a PandasSeriesColumn of strings, where each string is the path to an image. The column materializes the images into memory when indexed. If the column is lazy indexed with the
lzindexer, the images are not materialized and anImageCellor anImageColumnis returned instead.- Parameters
data (Sequence[str]) – A list of filepaths to images.
transform (callable) –
A function that transforms the image (e.g.
torchvision.transforms.functional.center_crop).Warning
In order for the column to be serializable, the transform function must be pickleable.
loader (callable) –
A callable with signature
def loader(filepath: str) -> PIL.Image:. Defaults totorchvision.datasets.folder.default_loader.Warning
In order for the column to be serializable with
write(), the loader function must be pickleable.base_dir (str) – A base directory that the paths in
dataare relative to. IfNone, the paths are assumed to be absolute.
meerkat.columns.lambda_column module
- class LambdaCell(fn: Optional[callable] = None, data: Optional[any] = None)[source]
Bases:
meerkat.cells.abstract.AbstractCell- property data: object
Get the data associated with this cell.
- class LambdaColumn(data: Union[meerkat.datapanel.DataPanel, meerkat.columns.abstract.AbstractColumn], fn: Optional[callable] = None, output_type: Optional[type] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn- static concat(columns: Sequence[meerkat.columns.lambda_column.LambdaColumn])[source]
- fn(data: object)[source]
Subclasses like ImageColumn should be able to implement their own version.
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
meerkat.columns.list_column module
- class ListColumn(data: Optional[Sequence] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn- batch(batch_size: int = 1, drop_last_batch: bool = False, collate: bool = True, *args, **kwargs)[source]
Batch the column.
- Parameters
batch_size – integer batch size
drop_last_batch – drop the last batch if its smaller than batch_size
collate – whether to collate the returned batches
- Returns
batches of data
- classmethod concat(columns: Sequence[meerkat.columns.list_column.ListColumn])[source]
- default_formatter()
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
meerkat.columns.numpy_column module
- class NumpyArrayColumn(data: Sequence, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn,numpy.lib.mixins.NDArrayOperatorsMixin- block_class
alias of
meerkat.block.numpy_block.NumpyBlock
- argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.numpy_column.NumpyArrayColumn[source]
Return indices that would sorted the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
NumpySeriesColumn
For now! Raises error when shape of input array is more than one error.
- classmethod concat(columns: Sequence[meerkat.columns.numpy_column.NumpyArrayColumn])[source]
- classmethod from_npy(path, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')[source]
- classmethod get_writer(mmap: bool = False, template: Optional[meerkat.columns.abstract.AbstractColumn] = None)[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- sort(ascending: Union[bool, List[bool]] = True, axis: int = - 1, kind: str = 'quicksort', order: Optional[Union[str, List[str]]] = None) meerkat.columns.numpy_column.NumpyArrayColumn[source]
Return a sorted view of the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
- to_tensor() torch.Tensor[source]
Use column.to_tensor() instead of torch.tensor(column), which is very slow.
- property is_mmap
meerkat.columns.pandas_column module
- class PandasSeriesColumn(data: Optional[Sequence] = None, collate_fn: Optional[Callable] = None, formatter: Optional[Callable] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.abstract.AbstractColumn,numpy.lib.mixins.NDArrayOperatorsMixin- block_class
- cat
alias of
meerkat.columns.pandas_column._MeerkatCategoricalAccessor
- dt
alias of
meerkat.columns.pandas_column._MeerkatCombinedDatetimelikeProperties
- str
alias of
meerkat.columns.pandas_column._MeerkatStringMethods
- argsort(ascending: bool = True, kind: str = 'quicksort') meerkat.columns.pandas_column.PandasSeriesColumn[source]
Return indices that would sorted the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
PandasSeriesColumn
For now! Raises error when shape of input array is more than one error.
- classmethod concat(columns: Sequence[meerkat.columns.pandas_column.PandasSeriesColumn])[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.pandas_column.PandasSeriesColumn[source]
Return a sorted view of the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
meerkat.columns.spacy_column module
- class SpacyColumn(data: Sequence[spacy_tokens.Doc] = None, *args, **kwargs)[source]
Bases:
meerkat.columns.list_column.ListColumn- classmethod from_texts(texts: Sequence[str], lang: str = 'en_core_web_sm', *args, **kwargs)[source]
- classmethod read(path: str, nlp: spacy.language.Language = None, lang: str = None, *args, **kwargs) SpacyColumn[source]
- property docs
- property tokens
meerkat.columns.tensor_column module
- class TensorColumn(data: Optional[Sequence] = None, *args, **kwargs)[source]
Bases:
numpy.lib.mixins.NDArrayOperatorsMixin,meerkat.columns.abstract.AbstractColumn- block_class
- argsort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor_column.TensorColumn[source]
Return indices that would sorted the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type
For now! Raises error when shape of input array is more than one error.
- classmethod concat(columns: Sequence[meerkat.columns.tensor_column.TensorColumn])[source]
- classmethod from_data(data: Union[Sequence, numpy.ndarray, pandas.core.series.Series, torch.Tensor, meerkat.columns.abstract.AbstractColumn])[source]
Convert data to an EmbeddingColumn.
- classmethod get_writer(mmap: bool = False, template: Optional[meerkat.columns.abstract.AbstractColumn] = None)[source]
- is_equal(other: meerkat.columns.abstract.AbstractColumn) bool[source]
Tests whether two columns.
- Parameters
other (AbstractColumn) – [description]
- sort(ascending: Union[bool, List[bool]] = True, kind: str = 'quicksort') meerkat.columns.tensor_column.TensorColumn[source]
Return a sorted view of the column.
- Parameters
ascending (Union[bool, List[bool]]) – Whether to sort in ascending or descending order. If a list, must be the same length as by. Defaults to True.
kind (str) – The kind of sort to use. Defaults to ‘quicksort’. Options include ‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’.
- Returns
A view of the column with the sorted data.
- Return type