pasteur.marginal.numpy.expand_table#
- pasteur.marginal.numpy.expand_table(attrs={}, tables={}, *, prealloc=None)[source]#
Takes in the raw idx encoded table and precalculates all column-height combinations of hierarchical attributes, with special versions for marginal calculations with attributes that have an NA value.
Returns: cols: A dictionary, list structure that can be accessed as cols[name][height] to get each row’s group with column <name> and <height> height.
cols_noncommon: A second version that’s offset by 1 or 2 depending on whether the parent attribute has na values/unknown values (+1 for each).
domain: The same structure containing the domain of each <name>,<height> combination.
It is then possible to calculate the marginal of an attribute with cols a,b,c, heights d,e,f and na values by doing the following:
``` :rtype:
tuple[dict[tuple[str|tuple[str,int] |None,str,bool],list[ndarray]],CalculationInfo]groups = col[a][d] + domain[a][d]*(cols_noncommon[b][e] + (domain[b][e]-1)*cols_noncommon[c][f]) np.bincount(groups, minlength=domain[a][d]*(domain[b][e] - 1)*(domain[c][f] - 1)) ```
The above expression only requires one vector multiplication and one vector addition per attribute added to the marginal, with bincount() scaling linearly with dataset size n.
For a dataset with size n=500k and 6 columns used in the marginal, it has a wallsize of 1.3ms, to 30ms of np.histogramdd.