| Title: | Single-Cell Exploratory Compositional Data Analysis |
|---|---|
| Description: | The scECODA R package provides a complete workflow for the analysis and visualization of compositional data, primarily focusing on cell type proportions derived from single-cell data. It implements specialized methods, such as the Centered Log-Ratio (CLR) transformation, to properly analyze proportional data while avoiding the biases introduced by the compositional constraint. The package encapsulates data management, transformation, and analysis into a single SummarizedExperiment object, offering downstream tools for dimensionality reduction via PCA, calculating critical metrics like the Adjusted Rand Index (ARI) and Modularity to quantify sample grouping quality, and generating high-quality visualizations like heatmaps and scatter plots. |
| Authors: | Christian Halter [aut, cre] (ORCID: <https://orcid.org/0009-0009-5479-2246>), Massimo Andreatta [aut] (ORCID: <https://orcid.org/0000-0002-8036-2647>), Santiago Carmona [aut] (ORCID: <https://orcid.org/0000-0002-2495-0671>), Swiss Cancer Research Foundation [fnd] |
| Maintainer: | Christian Halter <[email protected]> |
| License: | GPL-3 + file LICENSE |
| Version: | 1.1.5 |
| Built: | 2026-05-28 14:37:59 UTC |
| Source: | https://github.com/carmonalab/scECODA |
Calculates the ANOSIM R-statistic to test whether there is
significant separation between two or more groups (defined by
labels) based on the sample distance matrix (dist_mat).
calc_anosim(dist_mat, labels, permutations = 99, parallel = 1, digits = 3)calc_anosim(dist_mat, labels, permutations = 99, parallel = 1, digits = 3)
dist_mat |
Sample distance matrix. |
labels |
Vector of factors or character strings. The grouping variable (the known cluster assignments) for each row in the matrix. |
permutations |
Integer (default: |
parallel |
Integer (optional, default: |
digits |
Integer (default: |
ANOSIM compares the mean of rank dissimilarities between groups to the mean of rank dissimilarities within groups. The R-statistic ranges from -1 to 1:
An R value close to 1 indicates clear separation of groups.
An R value close to 0 indicates that the separation is no greater than expected by chance (i.e., poor separation).
An R value close to -1 indicates that within-group dissimilarities are greater than between-group dissimilarities (a very rare result).
A numeric value representing the ANOSIM R-statistic.
library(SummarizedExperiment) data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Extract necessary components dist_mat <- dist(t(assay(se, "clr"))) labels <- colData(se)$subject.cmv # Run the calculation calc_anosim(dist_mat, labels, parallel = 1)library(SummarizedExperiment) data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Extract necessary components dist_mat <- dist(t(assay(se, "clr"))) labels <- colData(se)$subject.cmv # Run the calculation calc_anosim(dist_mat, labels, parallel = 1)
Calculates the Adjusted Rand Index (ARI) to measure the agreement between the
known cluster assignments (labels) and the cluster assignments derived
from two unsupervised clustering methods: Hierarchical Clustering
(hclust) and Partitioning Around Medoids (pam).
calc_ari(dist_mat, labels, nclusts = NULL, digits = 3, return_mean = TRUE)calc_ari(dist_mat, labels, nclusts = NULL, digits = 3, return_mean = TRUE)
dist_mat |
Sample distance matrix. |
labels |
Vector of factors or character strings. The true or known cluster assignments for each row in the matrix. |
nclusts |
Integer (optional, default: |
digits |
Integer (default: |
return_mean |
Logical (default: |
A numeric value (mean ARI) or a list of two ARI scores. ARI ranges from -1 (disagreement) to +1 (perfect agreement).
library(SummarizedExperiment) data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Extract necessary components dist_mat <- dist(t(assay(se, "clr"))) labels <- colData(se)$subject.cmv # Run the calculation calc_ari(dist_mat, labels)library(SummarizedExperiment) data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Extract necessary components dist_mat <- dist(t(assay(se, "clr"))) labels <- colData(se)$subject.cmv # Run the calculation calc_ari(dist_mat, labels)
The CLR transformation is a common technique used for compositional data analysis, mapping the data from the simplex to Euclidean space. It is defined as the log of the ratio between each component and the geometric mean of all components in that sample. Note: This function assumes the input data is strictly positive (i.e., does not contain zeros). Use an imputation or pseudocount method prior to this function if zeros are present.
calc_clr(df)calc_clr(df)
df |
A data frame or matrix of strictly positive relative frequencies or counts (samples/observations as columns, components as rows). |
A data frame of the same dimensions containing the CLR-transformed values.
freq_imp <- data.frame(A = c(10.1, 89.9), B = c(50.1, 49.9)) calc_clr(freq_imp)freq_imp <- data.frame(A = c(10.1, 89.9), B = c(50.1, 49.9)) calc_clr(freq_imp)
This function takes a matrix or data frame of counts and transforms each column (assumed to be a sample or observation) into relative frequencies, expressing each component as a percentage of the column's total sum.
calc_freq(df)calc_freq(df)
df |
A data frame or matrix of counts (samples/observations as columns, components as rows). Must contain only numeric, non-negative values. |
A data frame of the same dimensions, where each column sums to 100.
counts <- data.frame(A = c(10, 90), B = c(50, 50)) calc_freq(counts)counts <- data.frame(A = c(10, 90), B = c(50, 50)) calc_freq(counts)
Calculates the Modularity score for a given clustering (labels) based
on a Shared Nearest Neighbor (SNN) graph.The score is adjusted by the
theoretical maximum modularity for the number of groups to always be between
-0.5 (poor clustering) and +1 (excellent clustering)).
calc_modularity(dist_mat, labels, knn_k = 3, digits = 3)calc_modularity(dist_mat, labels, knn_k = 3, digits = 3)
dist_mat |
Sample distance matrix. |
labels |
Vector of factors or character strings. The cluster assignments
for each row in |
knn_k |
Integer (optional, default: |
digits |
Integer (default: |
A numeric value representing the adjusted modularity score, ranging from -0.5 to 1.0.
Brandes, Ulrik, et al. "On finding graph clusterings with maximum modularity." European Symposium on Algorithms. Springer, Berlin, Heidelberg, 2007.
library(SummarizedExperiment) data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Extract necessary components dist_mat <- dist(t(assay(se, "clr"))) labels <- colData(se)$subject.cmv # Run the calculation ## Not run: calc_modularity(dist_mat, labels) ## End(Not run)library(SummarizedExperiment) data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Extract necessary components dist_mat <- dist(t(assay(se, "clr"))) labels <- colData(se)$subject.cmv # Run the calculation ## Not run: calc_modularity(dist_mat, labels) ## End(Not run)
Calculates the average silhouette width for a given feature matrix, using
pre-defined cluster assignments (labels). This metric assesses the
quality of the clustering.
calc_sil(dist_mat, labels, digits = 3)calc_sil(dist_mat, labels, digits = 3)
dist_mat |
Sample distance matrix. |
labels |
Vector of factors or character strings. The cluster assignments
for each row in |
digits |
Integer (default: |
A numeric value representing the mean silhouette width, typically ranging from -1 (poor clustering) to +1 (excellent clustering).
library(SummarizedExperiment) data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Extract necessary components dist_mat <- dist(t(assay(se, "clr"))) labels <- colData(se)$subject.cmv # Run the calculation calc_sil(dist_mat, labels)library(SummarizedExperiment) data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Extract necessary components dist_mat <- dist(t(assay(se, "clr"))) labels <- colData(se)$subject.cmv # Run the calculation calc_sil(dist_mat, labels)
This function aggregates single-cell count data into a "pseudobulk" matrix by summing the counts for all cells belonging to the same sample ID. It is robust to both dense and sparse count matrices. It also includes filtering logic to exclude samples that do not meet a minimum cell count threshold.
calculate_pseudobulk(count_matrix, sample_ids, min_cells = 10)calculate_pseudobulk(count_matrix, sample_ids, min_cells = 10)
count_matrix |
A gene x cell count matrix (can be a dense matrix or sparse Matrix). Gene identifiers should be row names and cell barcodes should be column names. |
sample_ids |
A vector of sample identifiers, one for each column (cell)
in |
min_cells |
Minimum number of cells required per sample (default: 1). Samples with fewer cells than this threshold will be excluded from the final pseudobulk matrix. |
A gene x sample pseudobulk count matrix. The columns correspond to the unique sample IDs, and the rows correspond to the genes.
nrow <- 100 ncol <- 100 vals <- nrow * ncol mat <- round(matrix(exp(rpois(vals, lambda = 3)), nrow = nrow, ncol = ncol)) rownames(mat) <- paste0("Gene", seq_len(ncol)) colnames(mat) <- paste0("Cell", seq_len(nrow)) ids <- rep(c("Sample1", "Sample2"), each = nrow / 2) pb <- calculate_pseudobulk(count_matrix = mat, sample_ids = ids)nrow <- 100 ncol <- 100 vals <- nrow * ncol mat <- round(matrix(exp(rpois(vals, lambda = 3)), nrow = nrow, ncol = ncol)) rownames(mat) <- paste0("Gene", seq_len(ncol)) colnames(mat) <- paste0("Cell", seq_len(nrow)) ids <- rep(c("Sample1", "Sample2"), each = nrow / 2) pb <- calculate_pseudobulk(count_matrix = mat, sample_ids = ids)
Identifies the indices of the K-nearest neighbors for each sample based on a pre-computed distance matrix.
compute_KNN_from_dist(dist_mat, knn_k)compute_KNN_from_dist(dist_mat, knn_k)
dist_mat |
A square distance matrix or an object of class |
knn_k |
Integer. The number of neighbors ( |
A matrix with nrow(dist_mat) rows and knn_k columns,
where each row contains the indices of the nearest neighbors.
Constructs a graph where nodes represent samples and edges are weighted by the number of shared nearest neighbors (SNN) between them.
compute_snn_graph(knn)compute_snn_graph(knn)
knn |
A matrix of nearest neighbor indices, typically generated by
|
An igraph graph object where edge weights correspond to the
count of shared neighbors between nodes.
This function takes either the relative abundance (freq) or
CLR-transformed abundance (clr) matrix from an
SummarizedExperiment
object, converts it from a wide (cell types x samples) to a long (sample,
cell type, value) format for plotting, and optionally joins it with a
specified column from the sample metadata.
create_long_data( se, assay = c("clr", "freq", "asin_sqrt", "clr_hvc"), label_col = NULL )create_long_data( se, assay = c("clr", "freq", "asin_sqrt", "clr_hvc"), label_col = NULL )
se |
An initialized SummarizedExperiment object. |
assay |
Character string (default: |
label_col |
Character string (optional, default: |
A tidy, long format data frame with columns:
sample_id: Sample identifier.
celltype: Cell type name.
rel_abundance or clr_abundance:
The quantitative value depending on the chosen
data_slot.
...: Additional column specified by
label_col (if provided).
This function normalizes a gene x sample pseudobulk count matrix using the
Variance Stabilizing Transformation (VST) from the DESeq2 package. It
estimates size factors and variance for all genes, performs the VST, and then
subsets the results to include only highly variable genes (HVCs), either
specified by the user or automatically selected based on variance.
deseq2_normalize( pb, metadata = NULL, deseq2_design = NULL, hvg = NULL, nvar_genes = 2000 )deseq2_normalize( pb, metadata = NULL, deseq2_design = NULL, hvg = NULL, nvar_genes = 2000 )
pb |
A gene x sample pseudobulk count matrix (Genes as rows, Samples as columns). Must contain non-negative integer counts. |
metadata |
A data.frame with sample-level metadata. Row names must
exactly match the column names of |
deseq2_design |
Optional: A |
hvg |
Optional character vector of gene names to use as highly variable genes. If provided, only these genes will be returned after VST. |
nvar_genes |
Number of top variable genes to select (default: 2000).
Only used if |
Note: If deseq2_design is not specified, the VST is performed using a
minimal design formula (~ 1) for size factor and variance estimation.
A normalized expression matrix (VST-transformed) with\ samples as columns and genes as rows.
nrow <- 100 ncol <- 100 vals <- nrow * ncol mat <- round(matrix(exp(rpois(vals, lambda = 3)), nrow = nrow, ncol = ncol)) rownames(mat) <- paste0("Gene", seq_len(ncol)) colnames(mat) <- paste0("Cell", seq_len(nrow)) ids <- rep(c("Sample1", "Sample2"), each = nrow / 2) pb <- calculate_pseudobulk(count_matrix = mat, sample_ids = ids) pb_norm <- deseq2_normalize(pb)nrow <- 100 ncol <- 100 vals <- nrow * ncol mat <- round(matrix(exp(rpois(vals, lambda = 3)), nrow = nrow, ncol = ncol)) rownames(mat) <- paste0("Gene", seq_len(ncol)) colnames(mat) <- paste0("Cell", seq_len(nrow)) ids <- rep(c("Sample1", "Sample2"), each = nrow / 2) pb <- calculate_pseudobulk(count_matrix = mat, sample_ids = ids) pb_norm <- deseq2_normalize(pb)
This is a smart constructor function used to initialize an
SummarizedExperiment
object. It handles the processing of single-cell objects (Seurat or
SingleCellExperiment) or raw data frames to extract cell type counts,
calculate sample metadata, and optionally generate DESeq2-normalized
pseudobulk data.
ecoda( data = NULL, data_is_freq = FALSE, metadata = NULL, sample_col = NULL, celltype_col = NULL, get_pb = FALSE, variance_explained = 0.5, top_n_hvcs = NULL, pseudo_count = 0.5, pseudo_frac_min = 2/3, add_to = NULL )ecoda( data = NULL, data_is_freq = FALSE, metadata = NULL, sample_col = NULL, celltype_col = NULL, get_pb = FALSE, variance_explained = 0.5, top_n_hvcs = NULL, pseudo_count = 0.5, pseudo_frac_min = 2/3, add_to = NULL )
data |
The primary input, which can be:
|
data_is_freq |
Logical (default: |
metadata |
A data frame containing sample-level metadata. Required if
|
sample_col |
The metadata column name defining unique sample IDs.
Required if |
celltype_col |
The metadata column name defining cell type annotations.
Required if |
get_pb |
Logical (default: |
variance_explained |
Numeric (default: 0.5). Proportion of variance used to determine the number of highly variable cell types (HVCs). |
top_n_hvcs |
Integer (optional). If provided, specifies the exact number
of top HVCs to select, overriding |
pseudo_count |
Numeric (default: 0.5). Used if |
pseudo_frac_min |
Numeric (default: 2/3). The multiplier for the minimum
non-zero value when |
add_to |
Character string. Should the value be added to |
A new SummarizedExperiment object.
data(example_data) Zhang <- example_data$Zhang counts <- Zhang$cell_counts$authors_LR freq <- calc_freq(counts) metadata <- Zhang$metadata se <- ecoda(data = counts, metadata = metadata) se <- ecoda(data = freq, metadata = metadata)data(example_data) Zhang <- example_data$Zhang counts <- Zhang$cell_counts$authors_LR freq <- calc_freq(counts) metadata <- Zhang$metadata se <- ecoda(data = counts, metadata = metadata) se <- ecoda(data = freq, metadata = metadata)
This is the internal engine that initializes an SummarizedExperiment object. It performs zero-imputation, Centered Log-Ratio (CLR) transformation, calculates sample distances, and identifies highly variable cell types (HVCs).
ecoda_helper( data = NULL, data_is_freq, variance_explained = 0.5, top_n_hvcs = NULL, pseudo_count = 0.5, pseudo_frac_min = 2/3, add_to = c("all", "zeros") )ecoda_helper( data = NULL, data_is_freq, variance_explained = 0.5, top_n_hvcs = NULL, pseudo_count = 0.5, pseudo_frac_min = 2/3, add_to = c("all", "zeros") )
data |
A matrix or data frame where columns are samples and rows are cell types. |
data_is_freq |
Logical. If |
variance_explained |
Numeric (default: 0.5). Target variance for HVCs. |
top_n_hvcs |
Integer (optional). Number of top HVCs to select. |
pseudo_count |
Numeric (default: 0.5). Used if |
pseudo_frac_min |
Numeric (default: 2/3). The multiplier for the minimum
non-zero value when |
add_to |
Character string. Should the value be added to |
A fully initialized SummarizedExperiment object.
A collection of single-cell compositional data from multiple studies, structured to demonstrate the full scECODA workflow.
data(example_data)data(example_data)
A named list where each element corresponds to a different
study (e.g., $Adams). Each study element is a list containing
the following:
cell_counts_highresolution: Data frame of cell type counts at
high resolution annotation (samples as rows, cell types as columns).
cell_counts_lowresolution: Data frame of cell type counts at
low resolution annotation (samples as rows, cell types as columns).
metadata: Data frame of sample-level metadata (samples as
rows).
main_biologicalcondition_columnname: Character string,
indicating the main biological condition column name in the metadata.
Data originally derived from multiple public single-cell datasets and re-processed as described in the accompanying scECODA manuscript (Halter et al., in preparation).
This function calculates the variance of each cell type across all samples based on the Centered Log-Ratio (CLR) transformed data. It then selects the subset of highly variable cell types (HVCs) that collectively explain a specified proportion of the total variance, or a fixed number of top cell types.
find_hvcs(se, variance_explained = 0.5, top_n_hvcs = NULL)find_hvcs(se, variance_explained = 0.5, top_n_hvcs = NULL)
se |
An initialized
SummarizedExperiment
object containing the CLR-transformed data in the |
variance_explained |
Numeric (default: 0.5). The target cumulative proportion of total variance to be explained by the selected HVCs. The function stops selecting cell types once this threshold is met. |
top_n_hvcs |
Integer (optional). If provided, this overrides
|
The updated SummarizedExperiment object with the following metadata populated:
celltype_variances: Data frame of cell type variances.
variance_explained: The exact variance proportion
captured by the selected HVCs.
top_n_hvcs: The number of HVCs selected.
hvcs: Character vector of the names
of the selected HVCs.
SummarizedExperiment,
get_celltype_variances, get_hvcs
data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Select HVCs that explain 75% of the total variance se <- find_hvcs(se, variance_explained = 0.75) # Select exactly the top 5 most variable cell types se <- find_hvcs(se, top_n_hvcs = 5)data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Select HVCs that explain 75% of the total variance se <- find_hvcs(se, variance_explained = 0.75) # Select exactly the top 5 most variable cell types se <- find_hvcs(se, top_n_hvcs = 5)
Get the cell type counts from a long data frame (e.g. seurat object metadata) where each cell is a row.
get_celltype_counts(cell_data_df, sample_col, celltype_col)get_celltype_counts(cell_data_df, sample_col, celltype_col)
cell_data_df |
A data frame where each row represents a single cell, and columns contain cell-level information, including sample ID and cell type annotation. |
sample_col |
The column that defines the sample ID for each cell |
celltype_col |
The column that defines the cell type annotation for each cell |
A data frame with cell types as rows and samples as columns, containing the count of each cell type per sample.
# Create example data frame cell_data_df <- data.frame( Cell_ID = paste0("C", seq(10)), Sample_Name = c(rep("S1", 5), rep("S2", 5)), Cluster_Annotation = factor(c( "B_Cell", "T_Cell", "B_Cell", NA, "T_Cell", "T_Cell", "Macrophage", "B_Cell", "T_Cell", "T_Cell" )) ) # # Calculate cell type counts per sample celltype_counts <- get_celltype_counts( cell_data_df = cell_data_df, sample_col = "Sample_Name", celltype_col = "Cluster_Annotation" ) # print(celltype_counts) # Note how the NA cell type count is handled and renamed to "NA".# Create example data frame cell_data_df <- data.frame( Cell_ID = paste0("C", seq(10)), Sample_Name = c(rep("S1", 5), rep("S2", 5)), Cluster_Annotation = factor(c( "B_Cell", "T_Cell", "B_Cell", NA, "T_Cell", "T_Cell", "Macrophage", "B_Cell", "T_Cell", "T_Cell" )) ) # # Calculate cell type counts per sample celltype_counts <- get_celltype_counts( cell_data_df = cell_data_df, sample_col = "Sample_Name", celltype_col = "Cluster_Annotation" ) # print(celltype_counts) # Note how the NA cell type count is handled and renamed to "NA".
This function takes the Centered Log-Ratio (CLR) transformed cell type abundance data from an SummarizedExperiment object, calculates the mean CLR abundance and variance for each cell type. It also calculates the cumulative variance explained by the cell types when ranked by variance.
get_celltype_variances(se, descending = TRUE)get_celltype_variances(se, descending = TRUE)
se |
An initialized
SummarizedExperiment
object containing CLR-transformed data in the |
descending |
Logical (default: |
A data frame where each row is a cell type, containing:
celltype: The name of the cell type.
avg_clr_abundance: The mean CLR value for that
cell type across all samples.
Variance: The variance of the CLR values for that
cell type across all samples.
cumulative_variance: Cumulative sum of the variance,
ranked by variance.
variance_exp: The proportion of total variance
explained cumulatively.
SummarizedExperiment,
plot_varmean
data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Calculate variances without plotting, sorted by ascending variance df_var_asc <- get_celltype_variances( se, descending = FALSE )data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Calculate variances without plotting, sorted by ascending variance df_var_asc <- get_celltype_variances( se, descending = FALSE )
Because not all data can be stored in assays (e.g. pseudobulk and clr_hvc have different number of features (rows) than counts etc.)
get_ecoda_assay(se, assay)get_ecoda_assay(se, assay)
se |
A SummarizedExperiment object. |
assay |
Character string, the name of the assay. |
A data frame containing the assay data.
This is a utility function that selects the most variable cell types from a variance-ranked data frame. Selection can be controlled either by defining the cumulative variance to be explained or by specifying a fixed number of cell types.
get_hvcs(df_var, variance_explained = 0.5, top_n_hvcs = NULL)get_hvcs(df_var, variance_explained = 0.5, top_n_hvcs = NULL)
df_var |
A data frame (typically the output of
|
variance_explained |
Numeric (default: 0.5). The cumulative proportion
of total variance that the selected HVCs must explain. This parameter is
ignored if |
top_n_hvcs |
Integer or Numeric (optional).
|
A character vector containing the names of the selected highly variable cell types. The function ensures that at least two cell types are always returned.
find_hvcs, get_celltype_variances
data(example_data) ecoda_object <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Calculate variances df_var <- get_celltype_variances(ecoda_object) # Select cell types explaining at least 60% of variance: hvcs_60 <- get_hvcs(df_var, variance_explained = 0.6) # Select the top 10 cell types: hvcs_top10 <- get_hvcs(df_var, top_n_hvcs = 10) # Select the top 5% of cell types: hvcs_prop <- get_hvcs(df_var, top_n_hvcs = 0.05)data(example_data) ecoda_object <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # Calculate variances df_var <- get_celltype_variances(ecoda_object) # Select cell types explaining at least 60% of variance: hvcs_60 <- get_hvcs(df_var, variance_explained = 0.6) # Select the top 10 cell types: hvcs_top10 <- get_hvcs(df_var, top_n_hvcs = 10) # Select the top 5% of cell types: hvcs_prop <- get_hvcs(df_var, top_n_hvcs = 0.05)
This function identifies columns in a cell-level metadata data frame that have a constant value for all cells belonging to the same sample. It aggregates this constant information, returning a new data frame where each row represents a unique sample. Columns that vary within any sample are excluded from the output.
get_sample_metadata(cell_data_df, sample_col)get_sample_metadata(cell_data_df, sample_col)
cell_data_df |
A data frame containing cell-level metadata. This should include the sample ID column and all potential metadata columns. |
sample_col |
A character string specifying the name of the column that defines the unique sample ID for each cell (e.g., "Sample_ID"). |
A new data frame where:
Each row corresponds to a unique sample from the input data.
The row names are set to the values of the input
sample_col.
Columns contain only the metadata fields that were constant
across all cells within each sample. The sample_col
itself is excluded from the final columns but used for
row names.
# Assuming you have a data frame 'cell_df' cell_df <- data.frame( Cell_ID = paste0("C", seq(10)), Sample_ID = c(rep("S1", 5), rep("S2", 5)), Age = c(rep(30, 5), rep(45, 5)), Gender = c(rep("M", 5), rep("F", 5)), Cell_Type = c(rep("A", 3), rep("B", 2), rep("A", 3), rep("B", 2)) ) # The 'Age' and 'Gender' columns are constant within each sample (S1 and S2). # The 'Cell_ID' and 'Cell_Type' columns vary within sample S1 and/or S2. sample_meta <- get_sample_metadata(cell_df, "Sample_ID") print(sample_meta) # Output will have 'Age' and 'Gender' as columns, # with row names 'S1' and 'S2'.# Assuming you have a data frame 'cell_df' cell_df <- data.frame( Cell_ID = paste0("C", seq(10)), Sample_ID = c(rep("S1", 5), rep("S2", 5)), Age = c(rep(30, 5), rep(45, 5)), Gender = c(rep("M", 5), rep("F", 5)), Cell_Type = c(rep("A", 3), rep("B", 2), rep("A", 3), rep("B", 2)) ) # The 'Age' and 'Gender' columns are constant within each sample (S1 and S2). # The 'Cell_ID' and 'Cell_Type' columns vary within sample S1 and/or S2. sample_meta <- get_sample_metadata(cell_df, "Sample_ID") print(sample_meta) # Output will have 'Age' and 'Gender' as columns, # with row names 'S1' and 'S2'.
This function visualizes the relative abundance (composition) of cell types across samples or across aggregated groups. It automatically handles data preparation and ordering based on the provided parameters.
plot_barplot( se, label_col = NULL, plot_by = c("sample", "group"), custom_sample_order = NULL, title = "", facet_by_label_col = TRUE )plot_barplot( se, label_col = NULL, plot_by = c("sample", "group"), custom_sample_order = NULL, title = "", facet_by_label_col = TRUE )
se |
A
SummarizedExperiment
object containing cell type relative frequencies in the |
label_col |
Character string (optional, default: |
plot_by |
Character string (default: |
custom_sample_order |
Character vector (optional, default: |
title |
Character string (default: |
facet_by_label_col |
Logical (default: |
A ggplot object representing the stacked bar plot.
create_long_data,
SummarizedExperiment
data(example_data) se <- ecoda( data = example_data$Zhang$cell_counts$authors_LR, metadata = example_data$Zhang$metadata, ) plot_barplot(se) # Plotting average cell type abundance by experimental group plot_barplot( se, label_col = "Tissue", plot_by = "group", title = "Mean Relative Abundance by Condition" ) # Plotting cell type abundance for each sample separately plot_barplot( se, label_col = "Tissue", plot_by = "sample", title = "Relative Abundance for Each Sample" )data(example_data) se <- ecoda( data = example_data$Zhang$cell_counts$authors_LR, metadata = example_data$Zhang$metadata, ) plot_barplot(se) # Plotting average cell type abundance by experimental group plot_barplot( se, label_col = "Tissue", plot_by = "group", title = "Mean Relative Abundance by Condition" ) # Plotting cell type abundance for each sample separately plot_barplot( se, label_col = "Tissue", plot_by = "sample", title = "Relative Abundance for Each Sample" )
This function visualizes the distribution of CLR-transformed abundance for each cell type using boxplots, optionally splitting the data by a sample metadata column and performing statistical tests for comparison between groups.
plot_boxplot( se, assay = c("clr", "clr_hvc", "asin_sqrt"), label_col = NULL, selected_celltypes = NULL, title = "", stat_method = "wilcox.test", paired = FALSE, signif_label = c("p.signif", "p.format") )plot_boxplot( se, assay = c("clr", "clr_hvc", "asin_sqrt"), label_col = NULL, selected_celltypes = NULL, title = "", stat_method = "wilcox.test", paired = FALSE, signif_label = c("p.signif", "p.format") )
se |
An
SummarizedExperiment
object containing the CLR-transformed abundances in the |
assay |
Character string (default: |
label_col |
Character string (optional, default: |
selected_celltypes |
Specify selected celltypes you want to plot instead of plotting boxplots for all. Note: this does not recalculate CLR-values on subset. |
title |
Character string (default: |
stat_method |
Character string (default: |
paired |
Logical (default: |
signif_label |
Character string (default: |
Statistical Test Logic:
If the number of groups (label_col levels) is 2, the
function uses the specified stat_method
(default: "wilcox.test") and displays pairwise significance labels
or stars.
If the number of groups is 3 or more, and
stat_method is "wilcox.test" or "t.test", the function
automatically switches to the global non-parametric test,
"kruskal.test", and displays the overall p-value
for the comparison across all groups.
A ggplot object (enhanced by ggpubr) representing the
CLR abundance boxplot, including jittered data points and dynamic
significance information (pairwise stars for 2 groups, overall p-value for
3+ groups).
create_long_data,
SummarizedExperiment
data(example_data) se <- ecoda( data = example_data$Zhang$cell_counts$authors_LR, metadata = example_data$Zhang$metadata, ) # 1. Boxplots for CLR abundance without grouping (no stats calculated): plot_boxplot(se) # 2. Boxplots grouped by 'Treatment' (2 groups) and applying Wilcoxon test: plot_boxplot( se, label_col = "Tissue", stat_method = "wilcox.test", title = "CLR Abundance by Tissue (with Wilcoxon Test)" )data(example_data) se <- ecoda( data = example_data$Zhang$cell_counts$authors_LR, metadata = example_data$Zhang$metadata, ) # 1. Boxplots for CLR abundance without grouping (no stats calculated): plot_boxplot(se) # 2. Boxplots grouped by 'Treatment' (2 groups) and applying Wilcoxon test: plot_boxplot( se, label_col = "Tissue", stat_method = "wilcox.test", title = "CLR Abundance by Tissue (with Wilcoxon Test)" )
Calculates the pairwise Pearson correlation matrix for all cell
types (columns) using the Centered Log-Ratio (CLR) transformed abundances
stored in assay(se, "clr"). It then visualizes this matrix as a
heatmap using corrplot::corrplot.
plot_corr( se, assay = c("clr", "counts", "counts_imp", "freq", "freq_imp", "asin_sqrt", "clr_hvc", "pb"), order = "hclust", hclust.method = "ward.D2", ... )plot_corr( se, assay = c("clr", "counts", "counts_imp", "freq", "freq_imp", "asin_sqrt", "clr_hvc", "pb"), order = "hclust", hclust.method = "ward.D2", ... )
se |
A SummarizedExperiment object. |
assay |
Character string (default: |
order |
Character string (default:
|
hclust.method |
Character string (default: |
... |
Additional arguments passed to |
The function uses the CLR matrix, where high correlation between two cell types suggests they vary together across samples, indicating potential co-occurrence or co-regulation.
A plot object generated by corrplot::corrplot, which is a base
R plot or a grid object depending on the corrplot version and options.
data(example_data) # Example of a large cohort with 868 samples and 69 cell types se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) plot_corr(se)data(example_data) # Example of a large cohort with 868 samples and 69 cell types se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) plot_corr(se)
This function visualizes a data matrix from a specified slot (e.g., CLR-transformed, frequency, or pseudobulk data) after mean-centering. It supports optional filtering to only Highly Variable Cell Types (HVCs) and includes a sample annotation sidebar based on a specified metadata column.
plot_heatmap( se, assay = c("clr", "counts", "counts_imp", "freq", "freq_imp", "asin_sqrt", "clr_hvc", "pb"), label_col, cluster_rows = TRUE, cluster_cols = TRUE, scale = "none", clustering_method = "ward.D2", angle_col = "90", ... )plot_heatmap( se, assay = c("clr", "counts", "counts_imp", "freq", "freq_imp", "asin_sqrt", "clr_hvc", "pb"), label_col, cluster_rows = TRUE, cluster_cols = TRUE, scale = "none", clustering_method = "ward.D2", angle_col = "90", ... )
se |
A SummarizedExperiment object. |
assay |
Character string (default: |
label_col |
Character string. The name of the column in
|
cluster_rows |
Logical (default: |
cluster_cols |
Logical (default: |
scale |
Character string (default: |
clustering_method |
Character string (default: |
angle_col |
Character string (default: |
... |
Additional arguments passed directly to the |
A pheatmap object, which is typically visualized automatically
when called interactively.
SummarizedExperiment,
pheatmap
# Example for a simple dataset: data(example_data) se <- ecoda( data = example_data$Zhang$cell_counts$authors_LR, metadata = example_data$Zhang$metadata, ) plot_heatmap(se, label_col = c("Clinical.efficacy.", "Tissue")) plot_heatmap( se, label_col = c("Clinical.efficacy.", "Tissue"), # Additional arguments for pheatmap: cutree_rows = 3, cutree_cols = 3 ) # Example of a large cohort with 868 samples and 69 cell types se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) plot_heatmap( se, label_col = c("subject.cmv", "age_group"), cutree_rows = 3, cutree_cols = 5, show_colnames = FALSE ) # Using only the most highly variable cell types (HVCs) plot_heatmap( se, assay = "clr_hvc", label_col = c("subject.cmv", "age_group"), cutree_rows = 3, cutree_cols = 4, show_colnames = FALSE )# Example for a simple dataset: data(example_data) se <- ecoda( data = example_data$Zhang$cell_counts$authors_LR, metadata = example_data$Zhang$metadata, ) plot_heatmap(se, label_col = c("Clinical.efficacy.", "Tissue")) plot_heatmap( se, label_col = c("Clinical.efficacy.", "Tissue"), # Additional arguments for pheatmap: cutree_rows = 3, cutree_cols = 3 ) # Example of a large cohort with 868 samples and 69 cell types se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) plot_heatmap( se, label_col = c("subject.cmv", "age_group"), cutree_rows = 3, cutree_cols = 5, show_colnames = FALSE ) # Using only the most highly variable cell types (HVCs) plot_heatmap( se, assay = "clr_hvc", label_col = c("subject.cmv", "age_group"), cutree_rows = 3, cutree_cols = 4, show_colnames = FALSE )
Performs Principal Component Analysis (PCA) on a selected data
matrix from the SummarizedExperiment object (default:
CLR-transformed abundances, clr) and visualizes the results. It can
also calculate and display several metrics to evaluate the separation of
groups defined by label_col. It uses
factoextra.
plot_pca( se, assay = c("clr", "counts", "counts_imp", "freq", "freq_imp", "asin_sqrt", "clr_hvc", "pb"), label_col = NULL, scale. = FALSE, title = NULL, title_show_n_features = TRUE, legend_title = "Group", show_label_samples = FALSE, score_digits = 3, cluster_score = FALSE, mod_score = FALSE, sil_score = FALSE, anosim_score = TRUE, anosim_permutations = 99, anosim_parallel = 1, ari_nclusts = NULL, knn_k = 3, pointsize = 3, labelsize = 4, coord_equal = TRUE, axes = c(1, 2), invisible = c("var", "quali"), geom = "point", n_hv_feat_show = Inf, repel = TRUE )plot_pca( se, assay = c("clr", "counts", "counts_imp", "freq", "freq_imp", "asin_sqrt", "clr_hvc", "pb"), label_col = NULL, scale. = FALSE, title = NULL, title_show_n_features = TRUE, legend_title = "Group", show_label_samples = FALSE, score_digits = 3, cluster_score = FALSE, mod_score = FALSE, sil_score = FALSE, anosim_score = TRUE, anosim_permutations = 99, anosim_parallel = 1, ari_nclusts = NULL, knn_k = 3, pointsize = 3, labelsize = 4, coord_equal = TRUE, axes = c(1, 2), invisible = c("var", "quali"), geom = "point", n_hv_feat_show = Inf, repel = TRUE )
se |
A SummarizedExperiment object. |
assay |
Character string (default: |
label_col |
Character string (optional, default: |
scale. |
Logical (default: |
title |
Character string (optional, default: |
title_show_n_features |
Logical (optional, default: |
legend_title |
Character string (default: |
show_label_samples |
Logical (default: |
score_digits |
Integer (default: |
cluster_score |
Logical (default: |
mod_score |
Logical (default: |
sil_score |
Logical (default: |
anosim_score |
Logical (default: |
anosim_permutations |
Integer (default: |
anosim_parallel |
Integer (default: |
ari_nclusts |
Integer (optional, default: |
knn_k |
Integer (optional, default: |
pointsize |
Numeric (default: |
labelsize |
Numeric (default: |
coord_equal |
Logical (default: |
axes |
Numeric vector (default: |
invisible |
Character vector (default: |
geom |
Character string or vector (default:
The default |
n_hv_feat_show |
Integer (default: |
repel |
Logical (default: |
The clustering metrics (ARI, Modularity, Silhouette, ANOSIM) assess
how well the sample groupings (labels) align with the underlying
data structure in the feature space defined by the PCA.
A ggplot object via factoextra visualizing the PCA
results.
data(example_data) se <- ecoda( data = example_data$Zhang$cell_counts$authors_LR, metadata = example_data$Zhang$metadata, ) plot_pca( se, label_col = "Tissue", title = "PCA based on cell type composition", anosim_parallel = 1, n_hv_feat_show = 5 # Shows the most highly variable features (cell types) ) # Using only the most highly variable cell types plot_pca( se, assay = "clr_hvc", label_col = "Tissue", title = "PCA based on highly variable cell types", anosim_parallel = 1, n_hv_feat_show = nrow(S4Vectors::metadata(se)$clr_hvc) )data(example_data) se <- ecoda( data = example_data$Zhang$cell_counts$authors_LR, metadata = example_data$Zhang$metadata, ) plot_pca( se, label_col = "Tissue", title = "PCA based on cell type composition", anosim_parallel = 1, n_hv_feat_show = 5 # Shows the most highly variable features (cell types) ) # Using only the most highly variable cell types plot_pca( se, assay = "clr_hvc", label_col = "Tissue", title = "PCA based on highly variable cell types", anosim_parallel = 1, n_hv_feat_show = nrow(S4Vectors::metadata(se)$clr_hvc) )
Performs Principal Component Analysis (PCA) on a selected data
matrix from the ECODA object (default: CLR-transformed abundances,
clr) and visualizes the results in 3D colored by label_col.
plot_pca3d( se, assay = c("clr", "counts", "counts_imp", "freq", "freq_imp", "asin_sqrt", "clr_hvc", "pb"), label_col = NULL, scale. = FALSE )plot_pca3d( se, assay = c("clr", "counts", "counts_imp", "freq", "freq_imp", "asin_sqrt", "clr_hvc", "pb"), label_col = NULL, scale. = FALSE )
se |
A SummarizedExperiment object. |
assay |
Character string (default: |
label_col |
Character string (optional, default: |
scale. |
Logical (default: |
An interactive 3D plotly object visualizing the PCA results.
data(example_data) se <- ecoda( data = example_data$Zhang$cell_counts$authors_LR, metadata = example_data$Zhang$metadata, ) plot_pca3d(se, label_col = "Tissue")data(example_data) se <- ecoda( data = example_data$Zhang$cell_counts$authors_LR, metadata = example_data$Zhang$metadata, ) plot_pca3d(se, label_col = "Tissue")
This function visualizes the relationship between the mean abundance (CLR) and the variance (CLR) for each cell type, which is typically used to identify Highly Variable Cell Types (HVCs).
plot_varmean( se, plot_title = "", highlight_hvcs = TRUE, labels = c("only_hvc", "all", "none"), plot_fit_line = FALSE, smooth_method = "lm" )plot_varmean( se, plot_title = "", highlight_hvcs = TRUE, labels = c("only_hvc", "all", "none"), plot_fit_line = FALSE, smooth_method = "lm" )
se |
A
SummarizedExperiment
object containing pre-calculated cell type variances in
|
plot_title |
Character string (default: ""). The title for the plot. |
highlight_hvcs |
Logical (default: |
labels |
Character (default: "only_hvc"). Options: "all" (label every point), "none" (no labels), or "only_hvc" (label only the highly variable cell types). |
plot_fit_line |
Logical (default: |
smooth_method |
Character string (default: "lm"). The smoothing method
to use for the regression line if |
If highlight_hvcs is TRUE, cell types previously
identified and stored in metadata(se)$hvcs will be highlighted in
red on the plot.
A ggplot object representing the mean-variance plot.
SummarizedExperiment,
get_celltype_variances
data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # 1. Generate the plot, highlighting HVCs (default) plot_varmean(se, plot_title = "HVCs on Mean-Variance Plot") # 2. Generate the plot without highlighting HVCs and add a fit line plot_varmean(se, highlight_hvcs = FALSE, plot_fit_line = TRUE, smooth_method = "loess" )data(example_data) se <- ecoda( data = example_data$Gongsharma_full$cell_counts$authors_HR, metadata = example_data$Gongsharma_full$metadata ) # 1. Generate the plot, highlighting HVCs (default) plot_varmean(se, plot_title = "HVCs on Mean-Variance Plot") # 2. Generate the plot without highlighting HVCs and add a fit line plot_varmean(se, highlight_hvcs = FALSE, plot_fit_line = TRUE, smooth_method = "loess" )
This function replaces zero values in a data frame or matrix to facilitate downstream transformations that require strictly positive values, such as the Centered Log-Ratio (CLR) transformation.
replace_zeros( df, rep_method = c("counts", "frac_min"), pseudo_count = 0.5, pseudo_frac_min = 2/3, add_to = NULL )replace_zeros( df, rep_method = c("counts", "frac_min"), pseudo_count = 0.5, pseudo_frac_min = 2/3, add_to = NULL )
df |
A data frame or matrix where zeros need to be replaced. |
rep_method |
Character string specifying the replacement strategy. One
of |
pseudo_count |
Numeric (default: 0.5). Used if |
pseudo_frac_min |
Numeric (default: 2/3). The multiplier for the minimum
non-zero value when |
add_to |
Character string. Should the value be added to |
The function provides two methods for handling zeros:
counts: Adds a fixed value (pseudo_count).
frac_min: Replaces zeros with a fraction
(pseudo_frac_min) of the smallest observed non-zero value in
the dataset.
A data frame or matrix of the same dimensions as df with all
zero values replaced by the calculated replacement value.
# Replace zeros in a count matrix counts_df <- data.frame(A = c(10, 0, 5), B = c(20, 10, 0)) replace_zeros(counts_df, rep_method = "counts", add_to = "all") # Replace zeros in a frequency matrix freq_df <- data.frame(A = c(0.5, 0.5), B = c(0.2, 0.8), C = c(0.0, 1.0)) replace_zeros(freq_df, rep_method = "frac_min", add_to = "zeros")# Replace zeros in a count matrix counts_df <- data.frame(A = c(10, 0, 5), B = c(20, 10, 0)) replace_zeros(counts_df, rep_method = "counts", add_to = "all") # Replace zeros in a frequency matrix freq_df <- data.frame(A = c(0.5, 0.5), B = c(0.2, 0.8), C = c(0.0, 1.0)) replace_zeros(freq_df, rep_method = "frac_min", add_to = "zeros")