
Simulate DMRs while preserving local methylation structure
Source:R/simulate_dmrs.R
simulateDMRs.RdThis simulator is inspired by dmrseq::simDMRs(), but adds differential
signal on the logit methylation scale. The same regional perturbation engine
is used across methylation assays by operating on methylated-count and
coverage-count representations. For microarray-style beta values, pseudo
counts are constructed first, then the same simulation mechanism is applied.
Usage
simulateDMRs(
beta,
num_dmrs = 3000L,
delta_max0 = 0.4,
groups = NULL,
case_group = NULL,
samplesheet = NULL,
samplesheet_sep = "\t",
sample_group_col = "Sample_Group",
covariates = NULL,
max_gap = 500L,
min_sites = 5L,
max_sites = 500L,
truth_min_delta_beta = 0,
delta_jitter = 1/3,
expected_correlation = 0.7,
neighbor_window = 5L,
profile_degree = 4L,
flank_fraction = 0.2,
resample_counts = TRUE,
array = c("450K", "27K", "EPIC", "EPICv2"),
genome = c("hg38", "hg19", "hs1", "mm10", "mm39"),
sorted_locs = NULL,
beta_row_names_file = NULL,
chrom_col = "#chrom",
start_col = "start",
njobs = getOption("CMEnt.njobs", .defaultNJobs()),
verbose = getOption("CMEnt.verbose", 1)
)Arguments
- beta
A BSseq object, a BetaHandler object, a beta matrix/data frame, a path to a beta/tabix file, or a path to an
.rdsfile containing a BSseq object.- num_dmrs
Number of DMRs to spike in.
- delta_max0
Baseline maximum beta-scale effect size near the center of each DMR.
- groups
Optional sample group vector. If
NULL, the first half of samples are assigned toCondition1and the remaining samples toCondition2.- case_group
Group receiving the differential shift. Defaults to the second group level.
- samplesheet
Optional data frame or path to a tab-delimited sample sheet with covariates. Row names, a
Sample_IDcolumn, or an unnamed first column must identify samples.- samplesheet_sep
Separator for samplesheet files. Default is tab.
- sample_group_col
Name of the phenotype column added to the returned samplesheet. Values are
"case"for perturbed samples and"control"for all other samples.- covariates
Optional covariate columns in
samplesheetto regress out on the logit methylation scale before DMRs are simulated. The per-site baseline is retained.- max_gap
Maximum gap, in bp, used to form candidate site segments.
- min_sites
Minimum number of sites per candidate DMR segment.
- max_sites
Maximum number of sites per candidate DMR segment.
- truth_min_delta_beta
Minimum intended beta-scale perturbation for a site to define the reported truth interval. The default,
0, reports the full selected segment so simulated flanks are not treated as false positives during benchmarking.- delta_jitter
Width of the random effect-size jitter around
delta_max0.- expected_correlation
Minimum within-DMR expected correlation target. The per-DMR target is the larger of this value and the estimated local background correlation, capped to the interval
[0, 0.99].- neighbor_window
Number of neighboring sites used to smooth the index-based correlated random effect inside each DMR. A value of
5uses up to two upstream and two downstream neighboring sites.- profile_degree
Degree used by the triweight profile.
- flank_fraction
Fraction of the selected segment width added on both sides before evaluating the triweight profile.
- resample_counts
If
TRUE, methylated counts in touched regions are redrawn from shifted probabilities and coverage. IfFALSE, counts are rounded deterministically.- array
Array platform type. Used only when
betais not a BSseq object or a self-contained BetaHandler.- genome
Reference genome. Used only when
betais not a BSseq object or a self-contained BetaHandler.- sorted_locs
Optional genomic locations with
chr,start, and optionallyendcolumns. Used only for non-BSseq inputs.- beta_row_names_file
Optional file with beta row names. Used only for non-BSseq file inputs.
- chrom_col
Chromosome column name for tabix inputs.
- start_col
Start column name for tabix inputs.
- njobs
Number of parallel jobs for loading non-BSseq inputs.
- verbose
Numeric. Logging verbosity level. Default comes from
getOption("CMEnt.verbose").
Value
A list with simulated output (simulated), optional genomic
locations for non-BSseq inputs (beta_locs), phenotype data (pheno),
and dmrseq-like metadata:
gr.dmrs, dmr.mncov, dmr.L, delta, truth, selected_regions,
groups, and case_group. For reproducible simulations, call
set.seed() before simulateDMRs().
Details
All inputs are first represented as methylated counts M and coverage
counts C, with beta values estimated as (M + 0.5) / (C + 1). Non-BSseq
beta matrices are converted to pseudo-counts with fixed coverage 100. If
covariates are supplied, their linear effects are removed on the logit
methylation scale while preserving each site's mean signal.
Candidate regions are contiguous genomic site segments: adjacent sites remain
in the same segment when they are on the same chromosome and no more than
max_gap bp apart. Eligible segments have between min_sites and
max_sites sites. Segment sampling is mildly weighted toward regions whose
median methylation is near 0.5, reducing floor and ceiling effects.
For each selected DMR, a smooth regional effect profile is evaluated over the
selected positions after expanding the segment by flank_fraction on both
sides. With center c, expanded width W, and d = profile_degree, the
site weight is
w_i = (1 - min(|pos_i - c| / (W / 2), 1)^d)^d. A signed maximum effect is
then drawn as s * delta_max * w_i, where s is chosen as hyper- or
hypomethylating and is flipped when needed to avoid the beta-scale boundary.
Case-sample probabilities are shifted through the logit scale. For baseline
probability p_i, intended beta-scale shift a_i, and correlated
site-level noise e_i, the simulated case probability is
plogis(qlogis(p_i) + qlogis(clamp(p_i + a_i)) - qlogis(p_i) + e_i).
The noise term is generated by smoothing independent normal variates across
neighboring sites defined by neighbor_window. Its standard deviation is
calibrated from a small lookup table so that the induced adjacent-site
correlation approaches the larger of expected_correlation and the fitted
local background correlation, capped at 0.99.
Finally, touched methylated counts are regenerated from the shifted
probabilities and original coverage. With resample_counts = TRUE, counts
are drawn as Binomial(C, p); otherwise they are deterministically rounded
as round(C * p). Reported truth intervals are based on the intended
beta-scale perturbation and truth_min_delta_beta, with a fallback to at
least min_sites strongest sites per DMR.
Examples
beta <- matrix(seq(0.1, 0.9, length.out = 48), nrow = 12)
rownames(beta) <- paste0("cg", seq_len(nrow(beta)))
colnames(beta) <- paste0("sample", seq_len(ncol(beta)))
locs <- data.frame(
chr = "chr1",
start = seq(100, by = 100, length.out = nrow(beta)),
row.names = rownames(beta)
)
set.seed(1)
simulated <- simulateDMRs(
beta, num_dmrs = 1, min_sites = 5, max_sites = 20,
sorted_locs = locs, verbose = 0
)