Generate synthetic samples while preserving both per-site coverage and
methylation marginals and the local correlation structure of neighboring sites.
Coverage and methylation are first fit with per-site Poisson/Beta marginals,
then synthetic samples are drawn through chromosome-local Gaussian copula
fields whose dependence is estimated from adjacent sites in the observed data.
For reproducible synthetic samples, call set.seed() before augmentBSSeq().
Usage
augmentBSSeq(
bs,
n_new_samples,
min_samples = 2,
calibrate_correlation = TRUE,
calibration_iterations = 8,
calibration_samples = NULL
)Arguments
- bs
A BSseq object
- n_new_samples
Number of new synthetic samples to generate
- min_samples
Minimum number of samples with coverage required per site
- calibrate_correlation
Logical. If
TRUE, iteratively adjusts the latent Gaussian length scales so adjacent-site correlations are matched after transforming back through the observed coverage and methylation sampling layers.- calibration_iterations
Maximum number of bisection iterations used for each correlation calibration.
- calibration_samples
Number of synthetic samples used internally for correlation calibration. If
NULL, a capped conservative default is chosen from the input and requested output sample sizes.
Details
The input is first sorted, duplicate loci are collapsed by summing
methylated and coverage counts, and sites are retained only when at least
min_samples samples have positive coverage. Synthetic coverage and
methylation are then modeled separately, but with genomic dependence restored
through latent Gaussian fields.
For coverage, each site i receives a Poisson mean
lambda_i = mean(C_ij | C_ij > 0), with a fallback of 1. A chromosome-local
latent Gaussian field Z^C is converted to uniforms by Phi(Z^C_i) and then
to coverage counts by qpois(Phi(Z^C_i), lambda_i), with a lower bound of 1
for synthetic samples.
For methylation, covered observations are smoothed as
p_ij = (M_ij + 0.5) / (C_ij + 1). The site mean is estimated with the same
Jeffreys-style pseudocounts, and the across-sample variance of p_ij is
converted to a Beta concentration
kappa_i = p_i * (1 - p_i) / var(p_ij) - 1. Site concentrations are shrunk
toward an empirical prior, then bounded to avoid extreme under- or
over-dispersion. This gives alpha_i = p_i * kappa_i and
beta_i = (1 - p_i) * kappa_i, with small positive lower bounds.
Local correlation is represented by an exponential Gaussian copula. For
adjacent sites separated by gap g, the latent correlation is
phi = exp(-g / ell), so within each chromosome
Z_i = phi_i * Z_{i-1} + sqrt(1 - phi_i^2) * epsilon_i. Separate length
scales ell are estimated for coverage from adjacent-site correlations of
log1p(C) and for methylation from adjacent-site correlations of
qlogis(p_ij).
If calibrate_correlation = TRUE, the initial length scales are refined by
bisection against the observed median adjacent-site correlation after passing
simulated latent pairs through the same Poisson, Beta, and Binomial sampling
layers used for final augmentation. This compensates for correlation
attenuation caused by the marginal sampling steps.
Final synthetic samples are generated as
C_i ~ Poisson(lambda_i), theta_i ~ Beta(alpha_i, beta_i), and
M_i ~ Binomial(C_i, theta_i), with both C_i and theta_i driven by their
calibrated chromosome-local latent fields. The returned object combines the
retained original sites and samples with the synthetic samples.
Examples
if (requireNamespace("bsseqData", quietly = TRUE)) {
data(BS.cancer.ex, package = "bsseqData")
BS.cancer.ex <- BS.cancer.ex[seq_len(100), ]
# \donttest{
set.seed(123)
augmented_bs <- augmentBSSeq(BS.cancer.ex, n_new_samples = 2)
# }
}