Skip to contents

This function assembles DMRs from a given set of seeds and a beta value file. It operates in three main stages:

  1. Seed Connectivity: It builds a connectivity array based on the correlation of beta values between seeds and their proximal sites, connecting seeds into preliminary DMRs based on significant correlations.

  2. DMR Expansion: It expands the preliminary DMRs by including nearby sites that show significant correlation with the seeds, allowing for a specified delta-beta threshold to connect sites that may not meet the correlation p-value cutoff but have a strong effect size.

  3. DMR Merging and Filtering: It merges overlapping extended DMRs into final DMRs and applies filtering based on the number of seeds and sites, as well as optional adjustments for array-based analyses.

Usage

buildDMRs(
  beta,
  seeds,
  pheno,
  seeds_id_col = NULL,
  sample_group_col = "Sample_Group",
  casecontrol_col = NULL,
  covariates = NULL,
  ext_site_delta_beta = 0.2,
  array = c("450K", "27K", "EPIC", "EPICv2", "NULL"),
  genome = NULL,
  max_pval = 0.05,
  entanglement = c("weak", "strong"),
  testing_mode = c("auto", "parametric", "empirical"),
  empirical_strategy = c("auto", "montecarlo", "permutations"),
  ntries = 200L,
  mid_p = FALSE,
  max_lookup_dist = 10000,
  expansion_window = "auto",
  max_bridge_seeds_gaps = 1L,
  max_bridge_extension_gaps = 1L,
  min_seeds = 2,
  min_adj_seeds = NULL,
  min_sites = 3,
  aggfun = c("median", "mean"),
  ignored_sample_groups = NULL,
  output_prefix = NULL,
  njobs = getOption("CMEnt.njobs", .defaultNJobs()),
  beta_row_names_file = NULL,
  annotate_with_genes = TRUE,
  .score_dmrs = TRUE,
  extract_motifs = TRUE,
  bed_provided = FALSE,
  bed_chrom_col = "chrom",
  bed_start_col = "start",
  verbose = getOption("CMEnt.verbose", 1),
  .load_debug = FALSE
)

Arguments

beta

Character. Path to the beta value file, or a tabix file, or a beta matrix, or a BetaHandler object, or a bed file. If a bed file is provided, it must at least contain bed_chrom_col and bed_chrom_start, followed by samples names in the given pheno, with corresponging beta values, and it will be converted to a tabix-indexed beta file internall, with the locations separately saved and queried as a DelayedDataFrame. object.

seeds

Character. Path to the seeds (seeds, etc.) TSV file or the seeds dataframe, in a format like the one produced by dmpFinder. If a pval, P.Value, p.value, or p_value column is present, DMR-level pval is computed from supporting seed p-values using Stouffer's method and qval is FDR-corrected globally.

pheno

Character. Path to the phenotype TSV file or the phenotype dataframe, containing sample information including group labels and optionally covariates.

seeds_id_col

Character. Column name or index for Seed identifiers in the seeds TSV file. Default is NULL, which corresponds to the rows names if existing, or the first column if not.

sample_group_col

Character. Column name for sample group information in the phenotype data. Default is NULL.

casecontrol_col

Boolean Column in pheno for case (TRUE/1) / control (FALSE/0) status . If NULL, controls will be assumed to be the first level of sample_group_col. Default is NULL.

covariates

Character vector of column names in pheno to adjust for (e.g. "age", "sex"). When provided, correlations are computed on residuals after regressing M-values on these covariates within each group

ext_site_delta_beta

Numeric. Minimum absolute delta beta value that will force proximal sites to be treated as connected during Stage 2 expansion, regardless of their correlation p-value. Set to NA, NULL, or Inf to disable this shortcut. A value of 0 means any proximal site with a non-missing case-control delta beta can be force-connected. Default is 0.2.

array

Character. Type of array used (e.g., "450K", "EPIC", "EPICv2", "27K"). Ignored if using a mouse genome. Also ignored if the beta file is provided as a beta values BED file. Default is "450K".

genome

Character. Genome version. Default is NULL and inferred as "hg19" for 450K, 27K, and EPIC arrays, otherwise "hg38".

max_pval

Numeric. Maximum p-value to assume seeds correlation is significant. Default is 0.05.

entanglement

Character. "weak" (default) requires at least one group to show significant correlation; "strong" requires all groups to show significant correlation for connectivity.

testing_mode

Character. "auto" (default) selects between t-based correlation p-values and empirical p-values per sample group using data diagnostics. You can also force "parametric" for t-based correlation p-values or "empirical" for permutation-based p-values.

empirical_strategy

Character. When testing_mode = "empirical": "auto" (default) uses Monte Carlo for groups with <6 samples and permutations for groups with >=6 samples; "montecarlo" always uses Monte Carlo; "permutations" always uses permutations.

ntries

Integer. Number of permutations when testing_mode = "empirical". Default is 0 (disabled).

mid_p

Logical. Whether to use mid-p values for empirical correlation tests. Default is FALSE.

max_lookup_dist

Numeric. Maximum distance to look up for adjacent seeds belonging to the same DMR during Stage 1. Default is 10000 (10 kb).

expansion_window

Numeric. Stage 2 connectivity is computed only in windows centered on seed-derived Stage 1 DMR neighborhoods, with this total window width in bp. This value sets a maximum effective size of a DMR after stage 2. Set <=0 for genome-wide connectivity. Default is -1 for microarrays and 10000 (10 kb) for NGS datasets.

max_bridge_seeds_gaps

Integer. Maximum number of consecutive failed seed-to-seed edges to bridge during Stage 1 when both flanking edges are connected and failures are p-value driven. Set to 0 to disable. Default is 1.

max_bridge_extension_gaps

Integer. Maximum gap size to consider during Stage 2 extension. Default is 1 (i.e., at most 1 consecutive failing site to bridge).

min_seeds

Numeric. Minimum number of connected seeds in a DMR. Minimum is 1. Default is 2.

min_adj_seeds

Numeric. Minimum number of seeds, adjusted by array site density, in a DMR after extension. It serves as a less stringent cutoff for arrays with variable site density, allowing regions in sparse areas to be retained if they have enough seeds relative to the local site density. Default is NULL (disabled).

min_sites

Numeric. Minimum number of sites in a DMR after extension, including the seeds. Minimum is 2. Default is 3.

aggfun

Function or character. Aggregation function to use when calculating delta beta values and p-values of DMRs. Can be "median", "mean", or a function (e.g., median, mean). Default is "median".

ignored_sample_groups

Character vector. Sample groups to ignore during connection and expansion, separated by commas. Can also be "case" or "control". Default is NULL.

output_prefix

Character. Identifier for the output files. If not provided, no output will be saved. Default is NULL.

njobs

Numeric. Number of parallel jobs to use. Default is the number of available cores.

beta_row_names_file

Character. Path to a file containing row names for the beta values. If not provided, row names will be read from the beta file. Default is NULL.

annotate_with_genes

Logical. Whether to annotate DMRs with overlapping genes. Default is TRUE.

.score_dmrs

Logical. Whether to add complementary cross-validated SVM discrimination scores to DMRs. Default is TRUE.

extract_motifs

Logical. Whether to compute DMRs seeds motifs. Default is TRUE.

bed_provided

Logical. Whether the beta file is provided as a BED file. Default is FALSE. In case the input has a .bed extension, this will be set to TRUE automatically.

bed_chrom_col

Character. Column name for chromosome in the BED file. Default is "chrom".

bed_start_col

Character. Column name for start position in the BED file. Default is "start".

verbose

Numeric. Level of verbosity for logging messages, from 0 (not verbose) to 5 (very very verbose). Default is retrieved from option "CMEnt.verbose".

.load_debug

Logical. If TRUE, enables debug mode for loading beta files. Default is FALSE.

Value

GRanges object of identified DMRs with metadata including DMR-level pval and FDR-adjusted qval when seed p-values are available.

Note on Input Data

Do not apply heavy filtering to your seeds prior to using this function, particularly based on beta values or effect sizes. The function works by expanding regions around seeds and connecting nearby sites into larger regions. Filtering out seeds with smaller effect sizes may remove important sites that could serve as "bridges" to connect more seeds into larger, biologically meaningful DMRs. For optimal results, include all statistically significant seeds (e.g., adjusted p-value < 0.05) and let the function handle region expansion and letting the function reconnect proximal sites during expansion using the ext_site_delta_beta parameter if needed. The p-value adjustment can be done using combinePvalues() or other methods, but avoid filtering based on beta value thresholds or effect size cutoffs before running this function. For BSSeq data, findDMPsBSSeq() performs seed finding using DSS.

Examples

loadExampleInputDataChr21And22("beta", "dmps", "pheno", "array_type")
# \donttest{
dmrs <- buildDMRs(
  beta = beta,
  seeds = dmps,
  pheno = pheno,
  array = array_type,
  sample_group_col = "Sample_Group"
)
# }