Add Complementary Classification Scores to DMRs

Scores Differentially Methylated Regions (DMRs) based on their ability to discriminate between sample groups using cross-validated Support Vector Machine (SVM) classification. For each DMR, this function performs stratified k-fold cross-prediction using an RBF kernel SVM and computes a margin-sensitive classification score based on decision values, which serves as a complementary measure of the DMR's discriminative power. Use this score alongside DMR-level pval, qval, and effect-size columns rather than as a replacement for statistical evidence. The scores are then smoothed along the genome using a Gaussian-kNN approach, and piecewise-linear segments are detected using the PELT algorithm, expecting a rising->plateau->decreasing pattern. Finally, DMRs are assigned to localized blocks based on the smoothed score profiles and specified gap rules.

Usage

scoreDMRs(
  dmrs,
  beta,
  pheno,
  covariates = NULL,
  genome = "hg38",
  array = "450K",
  sorted_locs = NULL,
  sample_group_col = "Sample_Group",
  block_gap_mode = c("adaptive", "fixed", "none"),
  block_gap_fixed_bp = NULL,
  block_gap_quantile = 0.95,
  block_gap_multiplier = 1.5,
  block_gap_min_bp = 2500,
  block_gap_max_bp = 50000,
  njobs = getOption("CMEnt.njobs", .defaultNJobs()),
  verbose = getOption("CMEnt.verbose", 1L)
)

Arguments

dmrs: Data frame or GRanges object containing DMR coordinates and metadata
beta: Character. Path to beta value file, tabix file, beta matrix, BetaHandler object, or bed file
pheno: Data frame. Phenotype data containing sample group information
covariates: Character vector of covariate columns in pheno to regress out before scoring. Default is NULL.
genome: Character. Genome version (e.g., "hg38", "hg19", "hs1", "mm10"). Default is "hg38"
array: Character. Array platform type (e.g., "450K", "EPIC", "EPICv2"). Default is "450K"
sorted_locs: Data frame. Optional pre-computed sorted genomic locations. Default is NULL
sample_group_col: Character. Column name in pheno containing sample group information. Default is "Sample_Group"
block_gap_mode: Character. Distance rule for block construction: "adaptive" (default), "fixed", or "none".
block_gap_fixed_bp: Numeric. Maximum allowed midpoint gap (bp) when block_gap_mode = "fixed". Ignored otherwise.
block_gap_quantile: Numeric in (0, 1). Quantile of chromosome DMR midpoint gaps used in adaptive thresholding. Default is 0.95.
block_gap_multiplier: Numeric > 0. Multiplier applied to the adaptive gap quantile. Default is 1.5.
block_gap_min_bp: Numeric >= 0. Lower clamp for adaptive gap threshold (bp). Default is 250000.
block_gap_max_bp: Numeric >= block_gap_min_bp. Upper clamp for adaptive gap threshold (bp). Default is 5000000.
njobs: Integer. Number of parallel jobs used for cross-validated scoring. Default comes from getOption("CMEnt.njobs").
verbose: Numeric. Logging verbosity level. Default comes from getOption("CMEnt.verbose").

Value

GRanges object with DMRs ordered by complementary classification score and additional metadata columns:

score: Margin-sensitive cross-validated classification score for the DMR
cv_accuracy: Raw cross-validated classification accuracy for the DMR
score_smoothed: Gaussian-kNN smoothed score trajectory per chromosome
segment_id: Piecewise-linear segment index estimated with PELT
segment_slope: Estimated slope of the segment that each DMR belongs to
block_id: Localized DMR block label (NA for DMRs not assigned to a block)

Details

The function uses stratified k-fold cross-prediction to ensure balanced representation of sample groups in each fold. The number of folds can be controlled using the option "CMEnt.scoring_nfold" (default is 5). An RBF (Radial Basis Function) kernel SVM is trained on the beta values of site sites within each DMR. For reproducible fold assignments, call set.seed() before scoreDMRs().

The score combines classification correctness and margin confidence, making it more sensitive than plain cross-validated accuracy when many DMRs classify perfectly. It is a complementary ranking and diagnostic measure, especially useful for sample-level separation. The cv_accuracy column stores the raw cross-validated accuracy for reference. Blocks are detected from smoothed score profiles and split at large midpoint gaps using the selected block_gap_mode.

Examples

# Load example data
loadExampleInputDataChr5And11()

# Load pre-computed DMRs
dmrs <- readRDS(system.file("extdata", "example_outputChr5And11.rds", package = "CMEnt"))

# score DMRs
scoring_dmrs <- scoreDMRs(
    dmrs = dmrs[1],
    beta = beta,
    pheno = pheno,
    sample_group_col = "Sample_Group"
)