Skip to contents

This simulator is inspired by dmrseq::simDMRs(), but adds differential signal on the logit methylation scale. The same regional perturbation engine is used across methylation assays by operating on methylated-count and coverage-count representations. For microarray-style beta values, pseudo counts are constructed first, then the same simulation mechanism is applied.

Usage

simulateDMRs(
  beta,
  num_dmrs = 3000L,
  delta_max0 = 0.4,
  groups = NULL,
  case_group = NULL,
  samplesheet = NULL,
  samplesheet_sep = "\t",
  sample_group_col = "Sample_Group",
  covariates = NULL,
  max_gap = 500L,
  min_sites = 5L,
  max_sites = 500L,
  truth_min_delta_beta = 0,
  delta_jitter = 1/3,
  expected_correlation = 0.7,
  neighbor_window = 5L,
  profile_degree = 4L,
  flank_fraction = 0.2,
  resample_counts = TRUE,
  array = c("450K", "27K", "EPIC", "EPICv2"),
  genome = c("hg38", "hg19", "hs1", "mm10", "mm39"),
  sorted_locs = NULL,
  beta_row_names_file = NULL,
  chrom_col = "#chrom",
  start_col = "start",
  njobs = getOption("CMEnt.njobs", .defaultNJobs()),
  verbose = getOption("CMEnt.verbose", 1)
)

Arguments

beta

A BSseq object, a BetaHandler object, a beta matrix/data frame, a path to a beta/tabix file, or a path to an .rds file containing a BSseq object.

num_dmrs

Number of DMRs to spike in.

delta_max0

Baseline maximum beta-scale effect size near the center of each DMR.

groups

Optional sample group vector. If NULL, the first half of samples are assigned to Condition1 and the remaining samples to Condition2.

case_group

Group receiving the differential shift. Defaults to the second group level.

samplesheet

Optional data frame or path to a tab-delimited sample sheet with covariates. Row names, a Sample_ID column, or an unnamed first column must identify samples.

samplesheet_sep

Separator for samplesheet files. Default is tab.

sample_group_col

Name of the phenotype column added to the returned samplesheet. Values are "case" for perturbed samples and "control" for all other samples.

covariates

Optional covariate columns in samplesheet to regress out on the logit methylation scale before DMRs are simulated. The per-site baseline is retained.

max_gap

Maximum gap, in bp, used to form candidate site segments.

min_sites

Minimum number of sites per candidate DMR segment.

max_sites

Maximum number of sites per candidate DMR segment.

truth_min_delta_beta

Minimum intended beta-scale perturbation for a site to define the reported truth interval. The default, 0, reports the full selected segment so simulated flanks are not treated as false positives during benchmarking.

delta_jitter

Width of the random effect-size jitter around delta_max0.

expected_correlation

Minimum within-DMR expected correlation target. The per-DMR target is the larger of this value and the estimated local background correlation, capped to the interval [0, 0.99].

neighbor_window

Number of neighboring sites used to smooth the index-based correlated random effect inside each DMR. A value of 5 uses up to two upstream and two downstream neighboring sites.

profile_degree

Degree used by the triweight profile.

flank_fraction

Fraction of the selected segment width added on both sides before evaluating the triweight profile.

resample_counts

If TRUE, methylated counts in touched regions are redrawn from shifted probabilities and coverage. If FALSE, counts are rounded deterministically.

array

Array platform type. Used only when beta is not a BSseq object or a self-contained BetaHandler.

genome

Reference genome. Used only when beta is not a BSseq object or a self-contained BetaHandler.

sorted_locs

Optional genomic locations with chr, start, and optionally end columns. Used only for non-BSseq inputs.

beta_row_names_file

Optional file with beta row names. Used only for non-BSseq file inputs.

chrom_col

Chromosome column name for tabix inputs.

start_col

Start column name for tabix inputs.

njobs

Number of parallel jobs for loading non-BSseq inputs.

verbose

Numeric. Logging verbosity level. Default comes from getOption("CMEnt.verbose").

Value

A list with simulated output (simulated), optional genomic locations for non-BSseq inputs (beta_locs), phenotype data (pheno), and dmrseq-like metadata: gr.dmrs, dmr.mncov, dmr.L, delta, truth, selected_regions, groups, and case_group. For reproducible simulations, call set.seed() before simulateDMRs().

Details

All inputs are first represented as methylated counts M and coverage counts C, with beta values estimated as (M + 0.5) / (C + 1). Non-BSseq beta matrices are converted to pseudo-counts with fixed coverage 100. If covariates are supplied, their linear effects are removed on the logit methylation scale while preserving each site's mean signal.

Candidate regions are contiguous genomic site segments: adjacent sites remain in the same segment when they are on the same chromosome and no more than max_gap bp apart. Eligible segments have between min_sites and max_sites sites. Segment sampling is mildly weighted toward regions whose median methylation is near 0.5, reducing floor and ceiling effects.

For each selected DMR, a smooth regional effect profile is evaluated over the selected positions after expanding the segment by flank_fraction on both sides. With center c, expanded width W, and d = profile_degree, the site weight is w_i = (1 - min(|pos_i - c| / (W / 2), 1)^d)^d. A signed maximum effect is then drawn as s * delta_max * w_i, where s is chosen as hyper- or hypomethylating and is flipped when needed to avoid the beta-scale boundary.

Case-sample probabilities are shifted through the logit scale. For baseline probability p_i, intended beta-scale shift a_i, and correlated site-level noise e_i, the simulated case probability is plogis(qlogis(p_i) + qlogis(clamp(p_i + a_i)) - qlogis(p_i) + e_i). The noise term is generated by smoothing independent normal variates across neighboring sites defined by neighbor_window. Its standard deviation is calibrated from a small lookup table so that the induced adjacent-site correlation approaches the larger of expected_correlation and the fitted local background correlation, capped at 0.99.

Finally, touched methylated counts are regenerated from the shifted probabilities and original coverage. With resample_counts = TRUE, counts are drawn as Binomial(C, p); otherwise they are deterministically rounded as round(C * p). Reported truth intervals are based on the intended beta-scale perturbation and truth_min_delta_beta, with a fallback to at least min_sites strongest sites per DMR.

Examples

beta <- matrix(seq(0.1, 0.9, length.out = 48), nrow = 12)
rownames(beta) <- paste0("cg", seq_len(nrow(beta)))
colnames(beta) <- paste0("sample", seq_len(ncol(beta)))
locs <- data.frame(
    chr = "chr1",
    start = seq(100, by = 100, length.out = nrow(beta)),
    row.names = rownames(beta)
)
set.seed(1)
simulated <- simulateDMRs(
    beta, num_dmrs = 1, min_sites = 5, max_sites = 20,
    sorted_locs = locs, verbose = 0
)