User Guide ========== This guide provides detailed information on using nuee for community ecology analysis. .. contents:: :local: :depth: 2 Overview -------- nuee is designed to provide a Pythonic interface to community ecology analyses, following the conventions of the R vegan package while leveraging the power of the scientific Python ecosystem. Data Format ----------- Community Data ~~~~~~~~~~~~~~ Community data should be provided as a matrix where: * Rows represent samples (sites, plots, etc.) * Columns represent species (taxa, OTUs, etc.) * Values represent abundances (counts, biomass, etc.) nuee accepts data in several formats: .. doctest:: >>> import nuee >>> import numpy as np >>> import pandas as pd >>> # NumPy array >>> data_array = np.random.rand(10, 20) # 10 samples, 20 species >>> # Pandas DataFrame (recommended) >>> data_df = pd.DataFrame( ... data_array, ... index=[f"Site{i}" for i in range(10)], ... columns=[f"Species{i}" for i in range(20)] ... ) Environmental Data ~~~~~~~~~~~~~~~~~~ Environmental data should have the same number of rows as the community data: .. testsetup:: import pandas as pd import numpy as np import nuee data_df = pd.DataFrame(np.random.rand(10)) .. doctest:: >>> env_data = pd.DataFrame({ ... "Temperature": np.random.rand(10), ... "pH": np.random.rand(10), ... "Moisture": np.random.rand(10) ... }, index=data_df.index) Distance Matrices ~~~~~~~~~~~~~~~~~ Distance matrices should be square, symmetric matrices: .. doctest:: >>> # Calculate distances >>> distances = nuee.vegdist(data_df, method="bray") Workflow Examples ----------------- Basic Ordination Workflow ~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Load and prepare data 2. Choose an ordination method 3. Fit the model 4. Visualize results 5. Interpret .. doctest:: >>> import nuee >>> import matplotlib.pyplot as plt >>> # 1. Load data >>> species = nuee.datasets.varespec() >>> env = nuee.datasets.varechem() >>> # 2. Choose method (NMDS) >>> # 3. Fit the model >>> nmds_result = nuee.metaMDS(species, k=2) >>> # 4. Visualize >>> fig = nuee.plot_ordination(nmds_result) >>> plt.show() >>> # 5. Interpret stress value >>> print(f"Stress: {nmds_result.stress:.3f}") Stress: 0.1 >>> # Stress < 0.05: excellent >>> # Stress < 0.10: good >>> # Stress < 0.20: acceptable >>> # Stress > 0.20: poor .. note:: ``nuee.metaMDS`` follows vegan's data transformations and SMACOF optimisation, but the underlying implementation is still evolving. Recent regression tests show small differences in the reported stress compared to ``vegan::metaMDS``. This does not invalidate the ordination, but if you require vegan-identical results you should re-run the analysis in R for the time being. Diversity Analysis Workflow ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. doctest:: >>> import nuee >>> import pandas as pd >>> # Load data >>> species = nuee.datasets.BCI() >>> # Calculate multiple diversity indices >>> diversity_df = pd.DataFrame({ ... "Shannon": nuee.shannon(species).values, ... "Gini-Simpson": nuee.simpson(species).values, ... "Richness": nuee.specnumber(species).values, ... "Fisher": nuee.fisher_alpha(species).values ... }) >>> # Summary statistics >>> print(diversity_df.describe()) [...] >>> # Compare groups >>> # If you have grouping information >>> # diversity_by_group = diversity_df.groupby(groups).mean() Hypothesis Testing Workflow ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. doctest:: >>> import nuee >>> # Load data >>> species = nuee.datasets.dune() >>> env = nuee.datasets.dune_env() >>> # Calculate distances >>> dist = nuee.vegdist(species, method="bray") >>> # Test for group differences (PERMANOVA) >>> perm_result = nuee.adonis2(dist, env['Management']) >>> print(f"R^2: {perm_result.R2.iloc[0]:.3f}") R^2: 0.342 >>> print(f"p-value: {perm_result['Pr(>F)'].iloc[0]:.3f}") p-value: ... >>> # Test for homogeneity of dispersions >>> betadisp = nuee.betadisper(dist, env['Management']) >>> print(betadisp) # doctest: +SKIP Tips and Best Practices ----------------------- Choosing an Ordination Method ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **NMDS**: Robust, works with any distance metric, no linearity assumptions * **RDA**: Linear relationships, environmental variables available * **CCA**: Unimodal relationships, long environmental gradients * **PCA**: Quick exploration, linear relationships Choosing a Distance Metric ~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Bray-Curtis**: General purpose, abundance data * **Jaccard**: Presence/absence data * **Euclidean**: Environmental data, PCA * **Hellinger**: Before RDA, avoids double-zero problem Data Transformation ~~~~~~~~~~~~~~~~~~~ .. doctest:: >>> import numpy as np >>> # Hellinger transformation (for RDA) >>> def hellinger(x): ... row_sums = x.sum(axis=1, keepdims=True) ... return np.sqrt(x / row_sums) >>> # Wisconsin double standardization >>> def wisconsin(x): ... # By species maxima ... x_std = x / x.max(axis=0) ... # By site totals ... x_std = x_std / x_std.sum(axis=1, keepdims=True) ... return x_std Compositional Data Workflows ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``nuee.composition`` brings compositional data analysis tools into the package without requiring SciPy. These utilities are NumPy-only ports of scikit-bio's composition module. .. doctest:: >>> from nuee import composition >>> import numpy as np >>> # Raw counts with zeros >>> counts = np.array([[0, 5, 10], [3, 0, 9]]) >>> # Replace zeros and apply closure >>> replaced = composition.multiplicative_replacement(counts) >>> closed = composition.closure(replaced) >>> # Transform to log-ratio space >>> clr_coords = composition.clr(closed) >>> ilr_coords = composition.ilr(closed) >>> # Invert transforms if required >>> recovered = composition.ilr_inv(ilr_coords) Mathematical Definitions ------------------------ The following formulas summarise the core quantities computed by nuee. Shannon Diversity ~~~~~~~~~~~~~~~~~ .. math:: H' = -\sum_{i=1}^{S} p_i \ln p_i where :math:`p_i = \frac{x_i}{\sum_{j=1}^{S} x_j}` is the relative abundance of species :math:`i` in a community of size :math:`S`. Gini-Simpson Diversity ~~~~~~~~~~~~~~~~~~~~~~ .. math:: D = 1 - \sum_{i=1}^{S} p_i^2 which measures the probability that two individuals drawn at random belong to different species. Bray-Curtis Dissimilarity ~~~~~~~~~~~~~~~~~~~~~~~~~ .. math:: BC_{ij} = 1 - \frac{2 \sum_{k=1}^{S} \min(x_{ik}, x_{jk})}{\sum_{k=1}^{S} x_{ik} + \sum_{k=1}^{S} x_{jk}} where :math:`x_{ik}` and :math:`x_{jk}` denote the abundances of species :math:`k` in sites :math:`i` and :math:`j`. Hellinger Transformation ~~~~~~~~~~~~~~~~~~~~~~~~ .. math:: h_{ik} = \sqrt{\frac{x_{ik}}{\sum_{j=1}^{S} x_{ij}}} which stabilises variances prior to linear ordination methods such as RDA. PERMANOVA F-statistic ~~~~~~~~~~~~~~~~~~~~~ .. math:: F = \frac{SS_{\text{between}} / (g - 1)}{SS_{\text{within}} / (N - g)} where :math:`g` is the number of groups and :math:`N` is the number of observations. Permutation p-values are obtained by recalculating :math:`F` across random group assignments. Interpretation Guidelines ------------------------- NMDS Stress Values ~~~~~~~~~~~~~~~~~~ * < 0.05: Excellent representation * 0.05 - 0.10: Good representation * 0.10 - 0.20: Acceptable * > 0.20: Poor (try different k or method) RDA/CCA Interpretation ~~~~~~~~~~~~~~~~~~~~~~ * Eigenvalues: Variance explained by each axis * Species scores: Optimal position for each species * Site scores: Position of each site * Environmental vectors: Direction and strength of correlation PERMANOVA Results ~~~~~~~~~~~~~~~~~ * R^2: Proportion of variance explained * F-statistic: Ratio of between-group to within-group variance * p-value: Significance (typically alpha = 0.05)