Compositional data analysis utilities tailored for nuee.
This module provides NumPy-only adaptations of scikit-bio’s composition
utilities so they can be used in environments where SciPy is unavailable
(for example Pyodide). The implementations are derived from the scikit-bio
project (Modified BSD License) with minimal adjustments for integration in
nuee.
Multiplicative zero replacement for compositional data.
Replaces zeros with a small value proportional to the detection limit
(or column minimum of non-zero values) and adjusts non-zero entries so
that each row sum is preserved exactly.
Parameters:
X (array-like or DataFrame, shape (n, D)) – Compositional data matrix. Zeros mark below-detection-limit values.
detection_limits (array-like of shape (D,), optional) – Per-component detection limits. When None, the column-wise minimum
of strictly positive values is used as a proxy.
delta (float, optional) – Fraction of the detection limit used as the replacement value.
Default is 0.65 (Martín-Fernández et al. 2003).
Returns:
Data with zeros replaced. Row sums match the input exactly.
Impute missing values in compositional data using the lrEM algorithm.
Uses the ALR (additive log-ratio) EM algorithm of Palarea-Albaladejo &
Martín-Fernández (2008), matching the approach in R’s zCompositions
package. Observed values are preserved exactly in the output.
Ideally one column should be fully observed (no NaN values) to serve
as the ALR denominator. When no column is complete, the column with the
fewest missing values is chosen and its gaps are pre-filled using
row-proportional estimation from column-mean ratios before running EM.
Parameters:
X (array-like or DataFrame, shape (n, D)) – Compositional data matrix. NaN marks missing components.
Observed (non-NaN) values must be strictly positive.
method ({"lrEM", "lrDA"}, default "lrEM") – "lrEM" returns the conditional expectation (deterministic).
"lrDA" adds noise from the conditional covariance for multiple
imputation / data augmentation.
max_iter (int, default 100) – Maximum number of EM iterations.
tol (float, default 1e-4) – Convergence tolerance on the relative change of the log-likelihood.
random_state (int, optional) – Seed for the random number generator (only used when method=”lrDA”).
Returns:
Completed data. Observed values are unchanged; imputed values are
scaled consistently with the original observed components.