Build Differentially Methylated Regions (DMRs) from Differentially Methylated Positions (seeds)

This function assembles DMRs from a given set of seeds and a beta value file. It operates in three main stages:

Seed Connectivity: It builds a connectivity array based on the correlation of beta values between seeds and their proximal sites, connecting seeds into preliminary DMRs based on significant correlations.
DMR Expansion: It expands the preliminary DMRs by including nearby sites that show significant correlation with the seeds, allowing for a specified delta-beta threshold to connect sites that may not meet the correlation p-value cutoff but have a strong effect size.
DMR Merging and Filtering: It merges overlapping extended DMRs into final DMRs and applies filtering based on the number of seeds and sites, as well as optional adjustments for array-based analyses.

Usage

buildDMRs(
  beta,
  seeds,
  pheno,
  seeds_id_col = NULL,
  sample_group_col = "Sample_Group",
  casecontrol_col = NULL,
  covariates = NULL,
  ext_site_delta_beta = 0.2,
  array = c("450K", "27K", "EPIC", "EPICv2", "NULL"),
  genome = NULL,
  max_pval = 0.05,
  entanglement = c("weak", "strong"),
  testing_mode = c("auto", "parametric", "empirical"),
  empirical_strategy = c("auto", "montecarlo", "permutations"),
  ntries = 200L,
  mid_p = FALSE,
  max_lookup_dist = 10000,
  expansion_window = "auto",
  max_bridge_seeds_gaps = 1L,
  max_bridge_extension_gaps = 1L,
  min_seeds = 2,
  min_adj_seeds = NULL,
  min_sites = 3,
  aggfun = c("median", "mean"),
  ignored_sample_groups = NULL,
  output_prefix = NULL,
  njobs = getOption("CMEnt.njobs", .defaultNJobs()),
  beta_row_names_file = NULL,
  annotate_with_genes = TRUE,
  .score_dmrs = TRUE,
  extract_motifs = TRUE,
  bed_provided = FALSE,
  bed_chrom_col = "chrom",
  bed_start_col = "start",
  verbose = getOption("CMEnt.verbose", 1),
  .load_debug = FALSE
)

Arguments

beta: Character. Path to the beta value file, or a tabix file, or a beta matrix, or a BetaHandler object, or a bed file. If a bed file is provided, it must at least contain bed_chrom_col and bed_chrom_start, followed by samples names in the given pheno, with corresponging beta values, and it will be converted to a tabix-indexed beta file internall, with the locations separately saved and queried as a DelayedDataFrame. object.
seeds: Character. Path to the seeds (seeds, etc.) TSV file or the seeds dataframe, in a format like the one produced by dmpFinder. If a pval, P.Value, p.value, or p_value column is present, DMR-level pval is computed from supporting seed p-values using Stouffer's method and qval is FDR-corrected globally.
pheno: Character. Path to the phenotype TSV file or the phenotype dataframe, containing sample information including group labels and optionally covariates.
seeds_id_col: Character. Column name or index for Seed identifiers in the seeds TSV file. Default is NULL, which corresponds to the rows names if existing, or the first column if not.
sample_group_col: Character. Column name for sample group information in the phenotype data. Default is NULL.
casecontrol_col: Boolean Column in pheno for case (TRUE/1) / control (FALSE/0) status . If NULL, controls will be assumed to be the first level of sample_group_col. Default is NULL.
covariates: Character vector of column names in pheno to adjust for (e.g. "age", "sex"). When provided, correlations are computed on residuals after regressing M-values on these covariates within each group
ext_site_delta_beta: Numeric. Minimum absolute delta beta value that will force proximal sites to be treated as connected during Stage 2 expansion, regardless of their correlation p-value. Set to NA, NULL, or Inf to disable this shortcut. A value of 0 means any proximal site with a non-missing case-control delta beta can be force-connected. Default is 0.2.
array: Character. Type of array used (e.g., "450K", "EPIC", "EPICv2", "27K"). Ignored if using a mouse genome. Also ignored if the beta file is provided as a beta values BED file. Default is "450K".
genome: Character. Genome version. Default is NULL and inferred as "hg19" for 450K, 27K, and EPIC arrays, otherwise "hg38".
max_pval: Numeric. Maximum p-value to assume seeds correlation is significant. Default is 0.05.
entanglement: Character. "weak" (default) requires at least one group to show significant correlation; "strong" requires all groups to show significant correlation for connectivity.
testing_mode: Character. "auto" (default) selects between t-based correlation p-values and empirical p-values per sample group using data diagnostics. You can also force "parametric" for t-based correlation p-values or "empirical" for permutation-based p-values.
empirical_strategy: Character. When testing_mode = "empirical": "auto" (default) uses Monte Carlo for groups with <6 samples and permutations for groups with >=6 samples; "montecarlo" always uses Monte Carlo; "permutations" always uses permutations.
ntries: Integer. Number of permutations when testing_mode = "empirical". Default is 0 (disabled).
mid_p: Logical. Whether to use mid-p values for empirical correlation tests. Default is FALSE.
max_lookup_dist: Numeric. Maximum distance to look up for adjacent seeds belonging to the same DMR during Stage 1. Default is 10000 (10 kb).
expansion_window: Numeric. Stage 2 connectivity is computed only in windows centered on seed-derived Stage 1 DMR neighborhoods, with this total window width in bp. This value sets a maximum effective size of a DMR after stage 2. Set <=0 for genome-wide connectivity. Default is -1 for microarrays and 10000 (10 kb) for NGS datasets.
max_bridge_seeds_gaps: Integer. Maximum number of consecutive failed seed-to-seed edges to bridge during Stage 1 when both flanking edges are connected and failures are p-value driven. Set to 0 to disable. Default is 1.
max_bridge_extension_gaps: Integer. Maximum gap size to consider during Stage 2 extension. Default is 1 (i.e., at most 1 consecutive failing site to bridge).
min_seeds: Numeric. Minimum number of connected seeds in a DMR. Minimum is 1. Default is 2.
min_adj_seeds: Numeric. Minimum number of seeds, adjusted by array site density, in a DMR after extension. It serves as a less stringent cutoff for arrays with variable site density, allowing regions in sparse areas to be retained if they have enough seeds relative to the local site density. Default is NULL (disabled).
min_sites: Numeric. Minimum number of sites in a DMR after extension, including the seeds. Minimum is 2. Default is 3.
aggfun: Function or character. Aggregation function to use when calculating delta beta values and p-values of DMRs. Can be "median", "mean", or a function (e.g., median, mean). Default is "median".
ignored_sample_groups: Character vector. Sample groups to ignore during connection and expansion, separated by commas. Can also be "case" or "control". Default is NULL.
output_prefix: Character. Identifier for the output files. If not provided, no output will be saved. Default is NULL.
njobs: Numeric. Number of parallel jobs to use. Default is the number of available cores.
beta_row_names_file: Character. Path to a file containing row names for the beta values. If not provided, row names will be read from the beta file. Default is NULL.
annotate_with_genes: Logical. Whether to annotate DMRs with overlapping genes. Default is TRUE.
.score_dmrs: Logical. Whether to add complementary cross-validated SVM discrimination scores to DMRs. Default is TRUE.
extract_motifs: Logical. Whether to compute DMRs seeds motifs. Default is TRUE.
bed_provided: Logical. Whether the beta file is provided as a BED file. Default is FALSE. In case the input has a .bed extension, this will be set to TRUE automatically.
bed_chrom_col: Character. Column name for chromosome in the BED file. Default is "chrom".
bed_start_col: Character. Column name for start position in the BED file. Default is "start".
verbose: Numeric. Level of verbosity for logging messages, from 0 (not verbose) to 5 (very very verbose). Default is retrieved from option "CMEnt.verbose".
.load_debug: Logical. If TRUE, enables debug mode for loading beta files. Default is FALSE.

Value

GRanges object of identified DMRs with metadata including DMR-level pval and FDR-adjusted qval when seed p-values are available.

Note on Input Data

Do not apply heavy filtering to your seeds prior to using this function, particularly based on beta values or effect sizes. The function works by expanding regions around seeds and connecting nearby sites into larger regions. Filtering out seeds with smaller effect sizes may remove important sites that could serve as "bridges" to connect more seeds into larger, biologically meaningful DMRs. For optimal results, include all statistically significant seeds (e.g., adjusted p-value < 0.05) and let the function handle region expansion and letting the function reconnect proximal sites during expansion using the ext_site_delta_beta parameter if needed. The p-value adjustment can be done using combinePvalues() or other methods, but avoid filtering based on beta value thresholds or effect size cutoffs before running this function. For BSSeq data, findDMPsBSSeq() performs seed finding using DSS.

Examples

loadExampleInputDataChr21And22("beta", "dmps", "pheno", "array_type")
# \donttest{
dmrs <- buildDMRs(
  beta = beta,
  seeds = dmps,
  pheno = pheno,
  array = array_type,
  sample_group_col = "Sample_Group"
)
# }