pasteur.extras.datasets.adult.AdultDataset#

class pasteur.extras.datasets.adult.AdultDataset(**_)[source]#

Attributes

catalog

A kedro catalog that represents the dataset's sources.

deps

Defines the Tables of the dataset and their dependencies, ex.:

folder_name

Specifies the name of the folder in the raw directory that will be used for the dataset's raw sources.

key_deps

Provides the table dependencies (Table, not raw) that are used to create the keys of the dataset.

name

raw_sources

A raw source that can be used to download the dataset.

raw_tables

Returns the raw dependency names of the dataset.

tables

Returns the table names of the dataset.

Methods

bootstrap(raw, dst)

ingest(name, **tables)

Creates the table <name> using the tables provided based on the dependencies.

keys(**tables)

Returns a DataFrame containing only the index column of table "table".

bootstrap(raw, dst)[source]#
catalog: dict[str, Any] | str | None = '/home/docs/checkouts/readthedocs.org/user_builds/pasteur/checkouts/latest/src/pasteur/extras/datasets/adult/catalog.yml'#

A kedro catalog that represents the dataset’s sources. Can be provided as a dictionary to be used as is, or as a filepath, in which case the path will be loaded and processed, by replacing the paths with appropriate ones based on the raw directory and folder name.

deps: dict[str, list[str]] = {'table': ['train', 'test']}#

Defines the Tables of the dataset and their dependencies, ex.:

`python {"table1": ["raw1", "raw2"], "table2": ["raw3", "raw4"]} `

folder_name: str | None = 'adult'#

Specifies the name of the folder in the raw directory that will be used for the dataset’s raw sources. If the folder does not exist, the dataset is disabled (used for packaging).

ingest(name, **tables)#

Creates the table <name> using the tables provided based on the dependencies.

The dependencies may be anything and should be defined in the catalog. The raw tables of a dataset are the only kedro datasets explicitly defined by the user.

Can return a dataframe, callable which produces a dataframe, or dict of callables, dataframes. If it’s a dict, the table will be partitioned using the dict keys.

@warning: all partitioned tables should have the same partitions. Some tables may not be partitioned.

Tip: use a match statement to fork based on table name to per-table functions.

key_deps: list[str] = ['table']#

Provides the table dependencies (Table, not raw) that are used to create the keys of the dataset.

keys(**tables)#

Returns a DataFrame containing only the index column of table “table”.

Return type:

DataFrame

name: str = 'adult'#
raw_sources: dict[str, RawSource] | RawSource | None = (['https://archive.ics.uci.edu/static/public/2/adult.zip'], None, False, None)#

A raw source that can be used to download the dataset.

Optionally, multiple sources can be supplied and downloaded with pasteur download <name>.

property raw_tables#

Returns the raw dependency names of the dataset.

property tables#

Returns the table names of the dataset.