Skip to contents

Reads methylation data from a custom BED file format, converts it to a tabix-indexed format for efficient random access, and creates genomic location indices. This function is designed to handle custom methylation array data or sequencing-based methylation data in BED format, making it compatible with the CMEnt workflow.

Usage

readCustomMethylationBedData(
  bed_file,
  pheno,
  genome = "hg38",
  chrom_col = "#chrom",
  start_col = "start",
  output_dir = NULL,
  chunk_size = 50000,
  output_prefix = NULL
)

Arguments

bed_file

Character. Path to the input BED file containing methylation data. The file should have chromosome and position columns, plus sample columns with methylation values. Can be gzipped (default: NULL)

pheno

Data frame. Phenotype data with sample IDs as rownames. Only samples present in both the pheno rownames and BED file header will be processed

genome

Character. Genome version to use (e.g., "hg38", "hg19", "hs1") (default: "hg38")

chrom_col

Character. Name of the chromosome column in the BED file (default: "#chrom")

start_col

Character. Name of the start position column in the BED file (default: "start")

output_dir

Character. Directory for caching processed files. If NULL, uses a temporary working directory unless output_prefix is provided (default: NULL)

chunk_size

Integer. Number of rows to process in each chunk for memory efficiency (default: 50000)

output_prefix

Character. Optional prefix used to persist derived BED/tabix artifacts next to analysis outputs.

Value

A list with two elements:

  • tabix_file: Character path to the created tabix-indexed BED file

  • locations: Disk-backed genomic location registry

Details

The function performs the following workflow:

  1. Validates that tabix and bgzip are available in the system PATH

  2. Checks the BED file header for required columns and sample IDs

  3. Processes the BED file in chunks to minimize memory usage

  4. Normalizes the BED format with standard BED6 columns (#chrom, start, end, id, score, strand)

  5. Converts chromosomes to integer factors for efficient sorting

  6. Creates a tabix-indexed compressed file for fast random access

  7. Persists derived artifacts under output_prefix when provided

Requirements

This function requires tabix and bgzip command-line tools to be installed and available in the system PATH. These tools are part of the HTSlib/samtools suite.

Memory Management

The function uses chunk-based processing to handle large BED files without loading the entire dataset into memory. The genomic locations are stored in a Registry object that can exceed available RAM by using disk-backed storage.

See also

convertBetaToTabix for converting standard beta files to tabix format getBetaHandler for creating a BetaHandler object from processed files

Examples

# Create a simple phenotype data frame
pheno <- data.frame(
    sample_group = c("case", "control"),
    row.names = c("Sample1", "Sample2")
)

if (nzchar(Sys.which("tabix")) && nzchar(Sys.which("bgzip"))) {
    bed_file <- tempfile(fileext = ".bed")
    writeLines(c(
        "#chrom\tstart\tSample1\tSample2",
        "chr1\t100\t0.2\t0.8",
        "chr1\t200\t0.3\t0.7"
    ), bed_file)
    result <- readCustomMethylationBedData(bed_file, pheno)
    result$tabix_file
}