
Build Differentially Methylated Regions (DMRs) from Differentially Methylated Positions (seeds)
Source:R/dmrs_builder.R
buildDMRs.RdThis function assembles DMRs from a given set of seeds and a beta value file. It operates in three main stages:
Seed Connectivity: It builds a connectivity array based on the correlation of beta values between seeds and their proximal sites, connecting seeds into preliminary DMRs based on significant correlations.
DMR Expansion: It expands the preliminary DMRs by including nearby sites that show significant correlation with the seeds, allowing for a specified delta-beta threshold to connect sites that may not meet the correlation p-value cutoff but have a strong effect size.
DMR Merging and Filtering: It merges overlapping extended DMRs into final DMRs and applies filtering based on the number of seeds and sites, as well as optional adjustments for array-based analyses.
Usage
buildDMRs(
beta,
seeds,
pheno,
seeds_id_col = NULL,
sample_group_col = "Sample_Group",
casecontrol_col = NULL,
covariates = NULL,
ext_site_delta_beta = 0.2,
array = c("450K", "27K", "EPIC", "EPICv2", "NULL"),
genome = NULL,
max_pval = 0.05,
entanglement = c("weak", "strong"),
testing_mode = c("auto", "parametric", "empirical"),
empirical_strategy = c("auto", "montecarlo", "permutations"),
ntries = 200L,
mid_p = FALSE,
max_lookup_dist = 10000,
expansion_window = "auto",
max_bridge_seeds_gaps = 1L,
max_bridge_extension_gaps = 1L,
min_seeds = 2,
min_adj_seeds = NULL,
min_sites = 3,
aggfun = c("median", "mean"),
ignored_sample_groups = NULL,
output_prefix = NULL,
njobs = getOption("CMEnt.njobs", .defaultNJobs()),
beta_row_names_file = NULL,
annotate_with_genes = TRUE,
.score_dmrs = TRUE,
extract_motifs = TRUE,
bed_provided = FALSE,
bed_chrom_col = "chrom",
bed_start_col = "start",
verbose = getOption("CMEnt.verbose", 1),
.load_debug = FALSE
)Arguments
- beta
Character. Path to the beta value file, or a tabix file, or a beta matrix, or a BetaHandler object, or a bed file. If a bed file is provided, it must at least contain bed_chrom_col and bed_chrom_start, followed by samples names in the given pheno, with corresponging beta values, and it will be converted to a tabix-indexed beta file internall, with the locations separately saved and queried as a DelayedDataFrame. object.
- seeds
Character. Path to the seeds (seeds, etc.) TSV file or the seeds dataframe, in a format like the one produced by dmpFinder. If a
pval,P.Value,p.value, orp_valuecolumn is present, DMR-levelpvalis computed from supporting seed p-values using Stouffer's method andqvalis FDR-corrected globally.- pheno
Character. Path to the phenotype TSV file or the phenotype dataframe, containing sample information including group labels and optionally covariates.
- seeds_id_col
Character. Column name or index for Seed identifiers in the seeds TSV file. Default is NULL, which corresponds to the rows names if existing, or the first column if not.
- sample_group_col
Character. Column name for sample group information in the phenotype data. Default is NULL.
- casecontrol_col
Boolean Column in pheno for case (TRUE/1) / control (FALSE/0) status . If NULL, controls will be assumed to be the first level of sample_group_col. Default is NULL.
- covariates
Character vector of column names in pheno to adjust for (e.g. "age", "sex"). When provided, correlations are computed on residuals after regressing M-values on these covariates within each group
- ext_site_delta_beta
Numeric. Minimum absolute delta beta value that will force proximal sites to be treated as connected during Stage 2 expansion, regardless of their correlation p-value. Set to
NA,NULL, orInfto disable this shortcut. A value of0means any proximal site with a non-missing case-control delta beta can be force-connected. Default is 0.2.- array
Character. Type of array used (e.g., "450K", "EPIC", "EPICv2", "27K"). Ignored if using a mouse genome. Also ignored if the beta file is provided as a beta values BED file. Default is "450K".
- genome
Character. Genome version. Default is NULL and inferred as "hg19" for 450K, 27K, and EPIC arrays, otherwise "hg38".
- max_pval
Numeric. Maximum p-value to assume seeds correlation is significant. Default is 0.05.
- entanglement
Character. "weak" (default) requires at least one group to show significant correlation; "strong" requires all groups to show significant correlation for connectivity.
- testing_mode
Character. "auto" (default) selects between t-based correlation p-values and empirical p-values per sample group using data diagnostics. You can also force "parametric" for t-based correlation p-values or "empirical" for permutation-based p-values.
- empirical_strategy
Character. When testing_mode = "empirical": "auto" (default) uses Monte Carlo for groups with <6 samples and permutations for groups with >=6 samples; "montecarlo" always uses Monte Carlo; "permutations" always uses permutations.
- ntries
Integer. Number of permutations when testing_mode = "empirical". Default is 0 (disabled).
- mid_p
Logical. Whether to use mid-p values for empirical correlation tests. Default is FALSE.
- max_lookup_dist
Numeric. Maximum distance to look up for adjacent seeds belonging to the same DMR during Stage 1. Default is 10000 (10 kb).
- expansion_window
Numeric. Stage 2 connectivity is computed only in windows centered on seed-derived Stage 1 DMR neighborhoods, with this total window width in bp. This value sets a maximum effective size of a DMR after stage 2. Set <=0 for genome-wide connectivity. Default is -1 for microarrays and 10000 (10 kb) for NGS datasets.
- max_bridge_seeds_gaps
Integer. Maximum number of consecutive failed seed-to-seed edges to bridge during Stage 1 when both flanking edges are connected and failures are p-value driven. Set to 0 to disable. Default is 1.
- max_bridge_extension_gaps
Integer. Maximum gap size to consider during Stage 2 extension. Default is 1 (i.e., at most 1 consecutive failing site to bridge).
- min_seeds
Numeric. Minimum number of connected seeds in a DMR. Minimum is 1. Default is 2.
- min_adj_seeds
Numeric. Minimum number of seeds, adjusted by array site density, in a DMR after extension. It serves as a less stringent cutoff for arrays with variable site density, allowing regions in sparse areas to be retained if they have enough seeds relative to the local site density. Default is NULL (disabled).
- min_sites
Numeric. Minimum number of sites in a DMR after extension, including the seeds. Minimum is 2. Default is 3.
- aggfun
Function or character. Aggregation function to use when calculating delta beta values and p-values of DMRs. Can be "median", "mean", or a function (e.g., median, mean). Default is "median".
- ignored_sample_groups
Character vector. Sample groups to ignore during connection and expansion, separated by commas. Can also be "case" or "control". Default is NULL.
- output_prefix
Character. Identifier for the output files. If not provided, no output will be saved. Default is NULL.
- njobs
Numeric. Number of parallel jobs to use. Default is the number of available cores.
- beta_row_names_file
Character. Path to a file containing row names for the beta values. If not provided, row names will be read from the beta file. Default is NULL.
- annotate_with_genes
Logical. Whether to annotate DMRs with overlapping genes. Default is TRUE.
- .score_dmrs
Logical. Whether to add complementary cross-validated SVM discrimination scores to DMRs. Default is TRUE.
- extract_motifs
Logical. Whether to compute DMRs seeds motifs. Default is TRUE.
- bed_provided
Logical. Whether the beta file is provided as a BED file. Default is FALSE. In case the input has a .bed extension, this will be set to TRUE automatically.
- bed_chrom_col
Character. Column name for chromosome in the BED file. Default is "chrom".
- bed_start_col
Character. Column name for start position in the BED file. Default is "start".
- verbose
Numeric. Level of verbosity for logging messages, from 0 (not verbose) to 5 (very very verbose). Default is retrieved from option "CMEnt.verbose".
- .load_debug
Logical. If TRUE, enables debug mode for loading beta files. Default is FALSE.
Value
GRanges object of identified DMRs with metadata including DMR-level
pval and FDR-adjusted qval when seed p-values are available.
Note on Input Data
Do not apply heavy filtering to your seeds prior to using this function, particularly based on
beta values or effect sizes. The function works by expanding regions around seeds
and connecting nearby sites into larger regions. Filtering out seeds with smaller effect sizes
may remove important sites that could serve as "bridges" to connect more seeds into
larger, biologically meaningful DMRs. For optimal results, include all statistically
significant seeds (e.g., adjusted p-value < 0.05) and let the function handle region expansion
and letting the function reconnect proximal sites during expansion using the
ext_site_delta_beta parameter if needed. The p-value adjustment can be done using combinePvalues() or other methods,
but avoid filtering based on beta value thresholds or effect size cutoffs before running this function. For BSSeq data,
findDMPsBSSeq() performs seed finding using DSS.
Examples
loadExampleInputDataChr21And22("beta", "dmps", "pheno", "array_type")
# \donttest{
dmrs <- buildDMRs(
beta = beta,
seeds = dmps,
pheno = pheno,
array = array_type,
sample_group_col = "Sample_Group"
)
# }