Read and Process Custom Methylation BED Data

Reads methylation data from a custom BED file format, converts it to a tabix-indexed format for efficient random access, and creates genomic location indices. This function is designed to handle custom methylation array data or sequencing-based methylation data in BED format, making it compatible with the CMEnt workflow.

Usage

readCustomMethylationBedData(
  bed_file,
  pheno,
  genome = "hg38",
  chrom_col = "#chrom",
  start_col = "start",
  output_dir = NULL,
  chunk_size = 50000,
  output_prefix = NULL
)

Arguments

bed_file: Character. Path to the input BED file containing methylation data. The file should have chromosome and position columns, plus sample columns with methylation values. Can be gzipped (default: NULL)
pheno: Data frame. Phenotype data with sample IDs as rownames. Only samples present in both the pheno rownames and BED file header will be processed
genome: Character. Genome version to use (e.g., "hg38", "hg19", "hs1") (default: "hg38")
chrom_col: Character. Name of the chromosome column in the BED file (default: "#chrom")
start_col: Character. Name of the start position column in the BED file (default: "start")
output_dir: Character. Directory for caching processed files. If NULL, uses a temporary working directory unless output_prefix is provided (default: NULL)
chunk_size: Integer. Number of rows to process in each chunk for memory efficiency (default: 50000)
output_prefix: Character. Optional prefix used to persist derived BED/tabix artifacts next to analysis outputs.

Value

A list with two elements:

tabix_file: Character path to the created tabix-indexed BED file
locations: Disk-backed genomic location registry

Details

The function performs the following workflow:

Validates that tabix and bgzip are available in the system PATH
Checks the BED file header for required columns and sample IDs
Processes the BED file in chunks to minimize memory usage
Normalizes the BED format with standard BED6 columns (#chrom, start, end, id, score, strand)
Converts chromosomes to integer factors for efficient sorting
Creates a tabix-indexed compressed file for fast random access
Persists derived artifacts under output_prefix when provided

Requirements

This function requires tabix and bgzip command-line tools to be installed and available in the system PATH. These tools are part of the HTSlib/samtools suite.

Memory Management

The function uses chunk-based processing to handle large BED files without loading the entire dataset into memory. The genomic locations are stored in a Registry object that can exceed available RAM by using disk-backed storage.

Examples

# Create a simple phenotype data frame
pheno <- data.frame(
    sample_group = c("case", "control"),
    row.names = c("Sample1", "Sample2")
)

if (nzchar(Sys.which("tabix")) && nzchar(Sys.which("bgzip"))) {
    bed_file <- tempfile(fileext = ".bed")
    writeLines(c(
        "#chrom\tstart\tSample1\tSample2",
        "chr1\t100\t0.2\t0.8",
        "chr1\t200\t0.3\t0.7"
    ), bed_file)
    result <- readCustomMethylationBedData(bed_file, pheno)
    result$tabix_file
}