Skip to contents

Generate synthetic samples while preserving both per-site coverage and methylation marginals and the local correlation structure of neighboring sites. Coverage and methylation are first fit with per-site Poisson/Beta marginals, then synthetic samples are drawn through chromosome-local Gaussian copula fields whose dependence is estimated from adjacent sites in the observed data. For reproducible synthetic samples, call set.seed() before augmentBSSeq().

Usage

augmentBSSeq(
  bs,
  n_new_samples,
  min_samples = 2,
  calibrate_correlation = TRUE,
  calibration_iterations = 8,
  calibration_samples = NULL
)

Arguments

bs

A BSseq object

n_new_samples

Number of new synthetic samples to generate

min_samples

Minimum number of samples with coverage required per site

calibrate_correlation

Logical. If TRUE, iteratively adjusts the latent Gaussian length scales so adjacent-site correlations are matched after transforming back through the observed coverage and methylation sampling layers.

calibration_iterations

Maximum number of bisection iterations used for each correlation calibration.

calibration_samples

Number of synthetic samples used internally for correlation calibration. If NULL, a capped conservative default is chosen from the input and requested output sample sizes.

Value

A BSseq object with original and synthetic samples

Details

The input is first sorted, duplicate loci are collapsed by summing methylated and coverage counts, and sites are retained only when at least min_samples samples have positive coverage. Synthetic coverage and methylation are then modeled separately, but with genomic dependence restored through latent Gaussian fields.

For coverage, each site i receives a Poisson mean lambda_i = mean(C_ij | C_ij > 0), with a fallback of 1. A chromosome-local latent Gaussian field Z^C is converted to uniforms by Phi(Z^C_i) and then to coverage counts by qpois(Phi(Z^C_i), lambda_i), with a lower bound of 1 for synthetic samples.

For methylation, covered observations are smoothed as p_ij = (M_ij + 0.5) / (C_ij + 1). The site mean is estimated with the same Jeffreys-style pseudocounts, and the across-sample variance of p_ij is converted to a Beta concentration kappa_i = p_i * (1 - p_i) / var(p_ij) - 1. Site concentrations are shrunk toward an empirical prior, then bounded to avoid extreme under- or over-dispersion. This gives alpha_i = p_i * kappa_i and beta_i = (1 - p_i) * kappa_i, with small positive lower bounds.

Local correlation is represented by an exponential Gaussian copula. For adjacent sites separated by gap g, the latent correlation is phi = exp(-g / ell), so within each chromosome Z_i = phi_i * Z_{i-1} + sqrt(1 - phi_i^2) * epsilon_i. Separate length scales ell are estimated for coverage from adjacent-site correlations of log1p(C) and for methylation from adjacent-site correlations of qlogis(p_ij).

If calibrate_correlation = TRUE, the initial length scales are refined by bisection against the observed median adjacent-site correlation after passing simulated latent pairs through the same Poisson, Beta, and Binomial sampling layers used for final augmentation. This compensates for correlation attenuation caused by the marginal sampling steps.

Final synthetic samples are generated as C_i ~ Poisson(lambda_i), theta_i ~ Beta(alpha_i, beta_i), and M_i ~ Binomial(C_i, theta_i), with both C_i and theta_i driven by their calibrated chromosome-local latent fields. The returned object combines the retained original sites and samples with the synthetic samples.

Examples

if (requireNamespace("bsseqData", quietly = TRUE)) {
    data(BS.cancer.ex, package = "bsseqData")
    BS.cancer.ex <- BS.cancer.ex[seq_len(100), ]
    # \donttest{
    set.seed(123)
    augmented_bs <- augmentBSSeq(BS.cancer.ex, n_new_samples = 2)
    # }
}