Overview
This vignette provides a comprehensive guide to the configuration available in CMEnt. Understanding these parameters will help you optimize the package for your specific analysis needs.
Function Parameters
Core Input Parameters
beta
Type: Character, matrix, BetaHandler object, or BED
file
Required: Yes
Description: Input methylation data. Can be:
- Path to a beta value file (tab-separated). The beta file will be
loaded into memory if its size is below
getOption("CMEnt.beta_in_mem_threshold_mb")Megabytes or genomically sorted and converted to tabix for faster access, if samtools tabix is installed. - Path to a tabix-indexed file (.bed.gz with .tbi index)
- A beta matrix with site IDs as rownames and sample IDs as column names.
- A BetaHandler object (see
?BetaHandler) - A BED file with columns
bed_chrom_colandbed_start_col, followed by sample columns, existing as row names in the provided pheno data
Example:
loadExampleInputDataChr5And11("beta")
seeds
Type: Character or data.frame
Required: Yes
Description: site sites to use as seeds for DMR
detection. Can be:
- Path to a file with line separated site IDs
- A data.frame with DMP information
Format requirements: - Row names or first column or
the column given by seeds_id_col should contain site IDs -
The site IDs must match those in the beta data, either as Illumina IDs
or genomic coordinates (chr:start). The latter is required when using
BED files for beta.
Example:
loadExampleInputDataChr5And11("dmps")
seeds <- dmps
pheno
Type: Data.frame
Required: Yes
Description: Sample phenotype information.
Format requirements: - Row names should match column
names in beta data - Must contain sample group information column
(specified by sample_group_col) - May contain case/control
status column (specified by casecontrol_col)
Example:
loadExampleInputDataChr5And11("pheno")Sample Grouping Parameters
sample_group_col
Type: Character
Default: "Sample_Group"
Description: Column name in pheno that
specifies sample groups (e.g., “case” vs “control”, “treated” vs
“untreated”). More than two groups are supported.
casecontrol_col
Type: Character
Default: NULL
Description: Column name in pheno for case
(TRUE/1) vs control (FALSE/0) status, for delta beta computations. If
NULL, controls are assumed to be the first level found at
sample_group_col.
Array and Genome Parameters
Filtering Parameters
ext_site_delta_beta
Type: Numeric
Default: 0.2
Range: 0 to 1, or NA to disable
Description: Absolute delta beta value for neighboring
sites to be included in DMRs, during the second stage
of DMR extension, without considering correlation.
Recommendation: Keep 0.2 for balanced
precision/recall. Use NA to disable the shortcut entirely.
Set to 0 only when you intentionally want any proximal site with a
non-missing case/control delta beta to be eligible for
force-connection.
min_seeds
Type: Integer
Default: 1
Description: Minimum number of connected seeds required
in a DMR.
Recommendation: Increase this value (e.g., to 3 or 4) for higher-confidence DMRs.
Region Building Parameters
max_lookup_dist
Type: Integer
Default: 10000
Unit: Base pairs
Description: Maximum genomic distance between adjacent
seeds to be considered part of the same DMR.
Recommendation: - 1,000-5,000 bp for tightly connected regions - 10,000 bp (default) for moderate spacing - 20,000+ bp for broader regions
expansion_window
Type: Numeric
Default: 1e6
Description: Stage 2 connectivity is computed only
around seed-derived Stage 1 neighborhoods, using this total window width
in base pairs.
Set to <= 0 to compute connectivity genome-wide.
Statistical Parameters
max_pval
Type: Numeric
Default: 0.05
Range: 0 to 1
Description: Maximum p-value threshold for considering
correlation between seeds as significant during the first stage of
connectivity testing, and between proxial sites during the second stage
DMR extension. Under strong entanglement, a Bonferroni
correction is applied based on the number of samples groups (number of
tests per site).
entanglement
Type: Character
Options: "strong",
"weak"
Default: "strong"
Description: Strategy for determining connectivity
between sites across sample groups:
-
"strong": Requires all sample groups to show significant correlation for two sites to be considered connected. This is more conservative and ensures consistent methylation patterns across all groups. -
"weak": Requires at least one sample group to show significant correlation. This is more permissive and may identify DMRs that are specific to certain groups.
Recommendation: Use "strong" (default)
for most cases to ensure robust, reproducible DMRs. Use
"weak" when you want to capture group-specific methylation
patterns or when working with heterogeneous sample groups.
Example:
dmrs <- buildDMRs(
beta = beta,
seeds = seeds,
pheno = pheno,
entanglement = "weak"
)
testing_mode
Type: Character
Options: "parametric",
"empirical", "auto"
Default: "auto"
Description: Method for calculating p-values during
connectivity testing:
-
"parametric": Uses t-based correlation p-values (faster, assumes normal distribution) -
"empirical": Uses permutation-based p-values (slower, no distribution assumptions) -
"auto": Evaluates correlation test assumptions per sample group and chooses"parametric"only when diagnostics are acceptable; otherwise switches to"empirical"
Recommendation: Use "auto" when you
want robust defaults across heterogeneous datasets. Use
"parametric" when assumptions are known to hold and runtime
is critical, and "empirical" when assumptions are clearly
questionable.
empirical_strategy
Type: Character
Options: "auto",
"montecarlo", "permutations"
Default: "auto"
Description: Strategy for empirical p-value calculation
(only applies when testing_mode = "empirical"):
-
"auto": Uses Monte Carlo for groups <6 samples, permutations for groups ≥6 samples -
"montecarlo": Always uses Monte Carlo simulation -
"permutations": Always uses exact permutations
ntries
Type: Integer
Default: 200
Description: Number of permutations/simulations when
testing_mode = "empirical". The number has an upper bound
of factorial(n) where n is the size of the
smallest sample group. If ntries exceeds this bound, it
will be reduced to factorial(n).
Recommendation: - 100-500: Faster, less precise - 1,000-10,000: Slower, more precise
Performance Parameters
njobs
Type: Integer
Default:
getOption("CMEnt.njobs", .defaultNJobs())
Description: Number of parallel jobs to use for
computation.
Recommendation: - Use -1 to
automatically use all available cores minus 1 - Limit to avoid
overwhelming system resources - Consider memory requirements when
increasing parallelization
Input/Output Parameters
seeds_id_col
Type: Character or integer
Default: NULL
Description: Column name or index for seed identifiers
in the seeds file. If NULL, uses row names if present,
otherwise the first column.
BED File Parameters
Annotation Parameters
annotate_with_genes
Type: Logical
Default: TRUE
Description: Whether to annotate DMRs with overlapping
genes.
.score_dmrs
Type: Logical
Default: TRUE
Description: Whether to add complementary SVM-based
discrimination scores to DMRs. When enabled, each DMR is evaluated for
its ability to separate sample groups using stratified k-fold
cross-validation with an RBF kernel SVM. The resulting
score and cv_accuracy values summarize
sample-level discriminative strength and should be read alongside DMR
pval, qval, and effect-size columns, not as
replacements for them.
Details: - Uses stratified k-fold cross-validation
(default: 5-fold) - Number of folds can be controlled with
options(CMEnt.scoring_nfold = 5) - Reproducible fold
assignments can be obtained with set.seed(...) before
calling scoreDMRs() - Higher score and
cv_accuracy values indicate stronger discriminative power -
Requires the e1071 package for SVM classification
Global Package Options
CMEnt uses several global options that can be set using the
options() function. These persist across function calls in
your R session.
Parallelism
Option: CMEnt.njobs
Type: Integer
Default:
min(8, parallel::detectCores(logical = TRUE) - 1)
Description: Number of parallel jobs (defaults to the
minimum of 8 and one less than the number of available CPU cores).
options("CMEnt.njobs" = 4)Verbosity
Option: CMEnt.verbose
Type: Integer
Default: 1
Description: Default verbosity level.
options("CMEnt.verbose" = 2)Memory Management
Option: CMEnt.beta_in_mem_threshold_mb
Type: Integer
Default: 500
Description: Maximum size (in Megabytes) of beta files
to load into memory. Files larger than this will be processed using
disk-based methods.
options("CMEnt.beta_in_mem_threshold_mb" = 200)Caching
Option: CMEnt.use_annotation_cache
Type: Logical
Default: TRUE
Description: Enable caching of gene annotations.
options("CMEnt.use_annotation_cache" = TRUE)Option: CMEnt.annotation_cache_dir
Type: Character
Default:
USER_CACHE_DIR/R/CMEnt/annotation_cache
Description: Directory for annotation cache.
options("CMEnt.annotation_cache_dir" = "/path/to/cache")Option: CMEnt.jaspar_cache_dir
Type: Character
Default:
USER_CACHE_DIR/R/CMEnt/jaspar_cache
Description: Directory for JASPAR motif database
cache.
options("CMEnt.jaspar_cache_dir" = "/path/to/cache")Motif Analysis
Option: CMEnt.jaspar_version
Type: Integer
Default: 2024
Description: JASPAR database version to use for motif
analysis.
options("CMEnt.jaspar_version" = 2024)Option: CMEnt.jaspar_tax_group
Type: Character
Default: "vertebrates"
Description: Taxonomic group for JASPAR motif
filtering.
options("CMEnt.min_motif_similarity" = 0.75)Option: CMEnt.min_motif_similarity
Type: Numeric
Default: 0.8
Description: Minimum motif similarity threshold for DMR
interaction analysis.
options("CMEnt.jaspar_tax_group" = "vertebrates")Option: CMEnt.jaspar_corr_threshold
Type: Numeric
Default: 0.9
Description: Correlation threshold for JASPAR motif
similarity.
options("CMEnt.jaspar_corr_threshold" = 0.85)Option: CMEnt.make_debug_dir
Type: Logical
Default: FALSE
Description: Create debug directory for
troubleshooting.
options("CMEnt.make_debug_dir" = TRUE)DMR scoring
Option: CMEnt.scoring_nfold
Type: Integer
Default: 5
Description: Number of folds for cross-validation when
scoring DMRs.
options("CMEnt.scoring_nfold" = 3)Configuration Examples
Example 1: High-Confidence DMRs with Strict Filtering
dmrs <- buildDMRs(
beta = beta,
seeds = seeds,
pheno = pheno,
sample_group_col = "Sample_Group",
array = "EPIC",
genome = "hg38",
ext_site_delta_beta = 0.2,
min_seeds = 3,
min_sites = 5,
max_lookup_dist = 5000,
max_pval = 0.01,
njobs = 4
)Example 2: Broad Region Detection with Relaxed Parameters
dmrs <- buildDMRs(
beta = beta,
seeds = seeds,
pheno = pheno,
sample_group_col = "Sample_Group",
min_seeds = 2,
min_sites = 3,
max_lookup_dist = 20000,
max_pval = 0.05,
njobs = 8
)Example 3: Empirical P-values for Small Sample Sizes
dmrs <- buildDMRs(
beta = beta,
seeds = seeds,
pheno = pheno,
sample_group_col = "Sample_Group",
testing_mode = "empirical",
empirical_strategy = "montecarlo",
ntries = 5000,
mid_p = TRUE,
njobs = 4
)Best Practices
Start with default parameters and adjust based on your specific needs.
For array data (450K, EPIC), use lower
min_sitesvalues (3-5) since site coverage is sparse.For WGBS data, keep
min_siteshigher (50+) to ensure robust regions.Avoid heavy pre-filtering of seeds based on effect size. Let CMEnt handle filtering internally.
Use empirical p-values for small sample sizes (<10 per group) or when normality assumptions are questionable.
Use parallel processing (
njobs > 1) for faster computation, but be mindful of memory requirements.Save intermediate results using
output_prefixfor large analyses.Document your configuration by saving parameter settings for reproducibility.
Troubleshooting
Issue: Out of Memory Errors
Solution: - Decrease njobs - Decrease
getOption("CMEnt.beta_in_mem_threshold_mb") (default 500)
to enable disk-based processing - Use tabix-indexed files for very large
datasets - Enable caching options
Issue: DMRs Too Small
Solution: - Increase max_lookup_dist.
This will allow seeds that are farther apart to be connected, leading to
larger DMRs. - Increase max_pval. This will make
connectivity testing less stringent, allowing more sites to be connected
and thus larger DMRs. - Decrease ext_site_delta_beta . This
will allow more sites to be included in DMRs during the second stage of
extension, leading to larger DMRs.
Issue: Too Many DMRs
Solution: - Increase min_seeds. This
will require more seeds to be connected to form a DMR, leading to fewer
total DMRs. - Increase min_sites. This will require more
sites to be included in a DMR, leading to fewer total DMRs. - Decrease
max_pval. This will make connectivity testing more
stringent, leading to fewer connected sites and thus fewer DMRs. -
Increase max_lookup_dist. This will join more seeds into
the same DMRs, reducing the total number of DMRs. - Decrease
ext_site_delta_beta. This will allow more sites to be
included in DMRs during the second stage of extension, leading to more
merging of nearby DMRs and thus fewer total DMRs.
Session Info
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Brussels
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DMRsegaldata_1.0.0 ExperimentHub_3.2.0 AnnotationHub_4.2.0
## [4] BiocFileCache_3.2.0 dbplyr_2.5.2 BiocGenerics_0.58.1
## [7] generics_0.1.4 CMEnt_0.99.0 BiocStyle_2.40.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_2.0.0
## [3] shape_1.4.6.1 magrittr_2.0.5
## [5] farver_2.1.2 rmarkdown_2.31
## [7] GlobalOptions_0.1.4 fs_2.1.0
## [9] BiocIO_1.22.0 ragg_1.5.2
## [11] vctrs_0.7.3 memoise_2.0.1
## [13] Rsamtools_2.28.0 DelayedMatrixStats_1.34.0
## [15] RCurl_1.98-1.19 htmltools_0.5.9
## [17] S4Arrays_1.12.0 lambda.r_1.2.4
## [19] curl_7.1.0 Rhdf5lib_2.0.0
## [21] SparseArray_1.12.2 rhdf5_2.56.0
## [23] strex_2.0.1 sass_0.4.10
## [25] bslib_0.11.0 htmlwidgets_1.6.4
## [27] desc_1.4.3 bsseq_1.48.0
## [29] testthat_3.3.2 httr2_1.2.2
## [31] futile.options_1.0.1 cachem_1.1.0
## [33] GenomicAlignments_1.48.0 lifecycle_1.0.5
## [35] pkgconfig_2.0.3 Matrix_1.7-5
## [37] R6_2.6.1 fastmap_1.2.0
## [39] MatrixGenerics_1.24.0 digest_0.6.39
## [41] colorspace_2.1-2 AnnotationDbi_1.74.0
## [43] S4Vectors_0.50.1 textshaping_1.0.5
## [45] GenomicRanges_1.64.0 RSQLite_3.53.1
## [47] beachmat_2.28.0 filelock_1.0.3
## [49] httr_1.4.8 abind_1.4-8
## [51] compiler_4.6.0 bit64_4.8.2
## [53] withr_3.0.2 backports_1.5.1
## [55] bedr_1.1.5 BiocParallel_1.46.0
## [57] DBI_1.3.0 HDF5Array_1.40.0
## [59] R.utils_2.13.0 rappdirs_0.3.4
## [61] DelayedArray_0.38.2 rjson_0.2.23
## [63] gtools_3.9.5 permute_0.9-10
## [65] tools_4.6.0 otel_0.2.0
## [67] R.oo_1.27.1 glue_1.8.1
## [69] VennDiagram_1.8.2 h5mread_1.4.0
## [71] restfulr_0.0.16 rhdf5filters_1.24.0
## [73] grid_4.6.0 checkmate_2.3.4
## [75] BSgenome_1.80.0 R.methodsS3_1.8.2
## [77] data.table_1.18.4 XVector_0.52.0
## [79] stringr_1.6.0 BiocVersion_3.23.1
## [81] pillar_1.11.1 limma_3.68.4
## [83] circlize_0.4.18 dplyr_1.2.1
## [85] lattice_0.22-9 rtracklayer_1.72.0
## [87] bit_4.6.0 tidyselect_1.2.1
## [89] locfit_1.5-9.12 Biostrings_2.80.1
## [91] knitr_1.51 bookdown_0.46
## [93] IRanges_2.46.0 Seqinfo_1.2.0
## [95] SummarizedExperiment_1.42.0 stats4_4.6.0
## [97] futile.logger_1.4.9 xfun_0.58
## [99] Biobase_2.72.0 statmod_1.5.2
## [101] brio_1.1.5 matrixStats_1.5.0
## [103] DT_0.34.0 stringi_1.8.7
## [105] UCSC.utils_1.8.0 yaml_2.3.12
## [107] evaluate_1.0.5 codetools_0.2-20
## [109] cigarillo_1.2.0 tibble_3.3.1
## [111] BiocManager_1.30.27 cli_3.6.6
## [113] systemfonts_1.3.2 jquerylib_0.1.4
## [115] dichromat_2.0-0.1 Rcpp_1.1.1-1.1
## [117] GenomeInfoDb_1.48.0 png_0.1-9
## [119] XML_3.99-0.23 parallel_4.6.0
## [121] pkgdown_2.2.0 blob_1.3.0
## [123] sparseMatrixStats_1.24.0 bitops_1.0-9
## [125] scales_1.4.0 purrr_1.2.2
## [127] crayon_1.5.3 rlang_1.2.0
## [129] KEGGREST_1.52.0 formatR_1.14
