pasteur.amalgam.synth.AmalgamSynth#

class pasteur.amalgam.synth.AmalgamSynth(pgm_cls, pgm={'etotal': 2.0}, marginal={'min_chunk': 100, 'mode': 'out_of_core', 'worker_mult': 1}, prompt='', model={'filename': 'Qwen3-8B-Q4_K_M.gguf', 'n_ctx': 40960, 'n_gpu_layers': -1, 'repo_id': 'Qwen/Qwen3-8B-GGUF', 'type': 'hf', 'workers': 1}, rebalance={'fixed': [4, 9, 18, 32], 'u': 7.0, 'unbounded_dp': True}, samples=None, **kwargs)[source]#

Attributes

Methods

bake(data)

Bakes the model based on the data provided (such as creating and modeling a bayesian network on the data).

fit(data)

Fits the model based on the provided data.

get_factory(*args, **kwargs)

Returns a factory that registers this module to the system.

preprocess(meta, data)

Runs any preprocessing required, such as domain reduction.

sample([n, data, _llm])

Samples n samples across partitions partitions.

sample_partition(*, n[, i])

Returns synthetic data in the same format they were provided.

bake(data)[source]#

Bakes the model based on the data provided (such as creating and modeling a bayesian network on the data).

Attributes provide context about the data columns, including hierarchical relationships, na vals, etc.

fit(data)[source]#

Fits the model based on the provided data.

Data and Ids are dictionaries containing the dataframes with the data.

classmethod get_factory(*args, **kwargs)#

Returns a factory that registers this module to the system.

Any *args and **kwargs passed to this function will be saved and passed to the module’s __init__() method when calling build().

in_sample: bool = True#
in_types: list[str] | None = ['json', 'flat']#
name: str = 'amalgam'#
partitions = 1#
preprocess(meta, data)[source]#

Runs any preprocessing required, such as domain reduction.

sample(n=None, data=None, _llm=None)[source]#

Samples n samples across partitions partitions.

The return value should be finalized to dict[str, Any], which matches the format of data provided to the fitting function. Since this

A default implementation is provided, that packages sample_partition() in such a way that pasteur can sample and save partitions in parallel.

Return type:

AmalgamInput

sample_partition(*, n, i=0)#

Returns synthetic data in the same format they were provided.

n sets how many rows should be sampled. Otherwise, Warning: not setting n technically violates DP for DP-aware algorithms.

i is the partition number that can be used for modifying the random state sampling, since deterministic sampling will always return the same data.

Return type:

dict[str, Any]

type = 'json'#