pasteur.marginal.numpy.expand_table

Contents

pasteur.marginal.numpy.expand_table#

pasteur.marginal.numpy.expand_table(attrs={}, tables={}, *, prealloc=None)[source]#

Takes in the raw idx encoded table and precalculates all column-height combinations of hierarchical attributes, with special versions for marginal calculations with attributes that have an NA value.

Returns: cols: A dictionary, list structure that can be accessed as cols[name][height] to get each row’s group with column <name> and <height> height.

cols_noncommon: A second version that’s offset by 1 or 2 depending on whether the parent attribute has na values/unknown values (+1 for each).

domain: The same structure containing the domain of each <name>,<height> combination.

It is then possible to calculate the marginal of an attribute with cols a,b,c, heights d,e,f and na values by doing the following:

``` :rtype: tuple[dict[tuple[str | tuple[str, int] | None, str, bool], list[ndarray]], CalculationInfo]

groups = col[a][d] + domain[a][d]*(cols_noncommon[b][e] + (domain[b][e]-1)*cols_noncommon[c][f]) np.bincount(groups, minlength=domain[a][d]*(domain[b][e] - 1)*(domain[c][f] - 1)) ```

The above expression only requires one vector multiplication and one vector addition per attribute added to the marginal, with bincount() scaling linearly with dataset size n.

For a dataset with size n=500k and 6 columns used in the marginal, it has a wallsize of 1.3ms, to 30ms of np.histogramdd.