
Read and Process Custom Methylation BED Data
Source:R/read_custom_methylation_bed_data.R
readCustomMethylationBedData.RdReads methylation data from a custom BED file format, converts it to a tabix-indexed format for efficient random access, and creates genomic location indices. This function is designed to handle custom methylation array data or sequencing-based methylation data in BED format, making it compatible with the CMEnt workflow.
Usage
readCustomMethylationBedData(
bed_file,
pheno,
genome = "hg38",
chrom_col = "#chrom",
start_col = "start",
output_dir = NULL,
chunk_size = 50000,
output_prefix = NULL
)Arguments
- bed_file
Character. Path to the input BED file containing methylation data. The file should have chromosome and position columns, plus sample columns with methylation values. Can be gzipped (default: NULL)
- pheno
Data frame. Phenotype data with sample IDs as rownames. Only samples present in both the pheno rownames and BED file header will be processed
- genome
Character. Genome version to use (e.g., "hg38", "hg19", "hs1") (default: "hg38")
- chrom_col
Character. Name of the chromosome column in the BED file (default: "#chrom")
- start_col
Character. Name of the start position column in the BED file (default: "start")
- output_dir
Character. Directory for caching processed files. If NULL, uses a temporary working directory unless
output_prefixis provided (default: NULL)- chunk_size
Integer. Number of rows to process in each chunk for memory efficiency (default: 50000)
- output_prefix
Character. Optional prefix used to persist derived BED/tabix artifacts next to analysis outputs.
Value
A list with two elements:
tabix_file: Character path to the created tabix-indexed BED file
locations: Disk-backed genomic location registry
Details
The function performs the following workflow:
Validates that tabix and bgzip are available in the system PATH
Checks the BED file header for required columns and sample IDs
Processes the BED file in chunks to minimize memory usage
Normalizes the BED format with standard BED6 columns (#chrom, start, end, id, score, strand)
Converts chromosomes to integer factors for efficient sorting
Creates a tabix-indexed compressed file for fast random access
Persists derived artifacts under
output_prefixwhen provided
Requirements
This function requires tabix and bgzip command-line tools to be installed and available in the system PATH. These tools are part of the HTSlib/samtools suite.
Memory Management
The function uses chunk-based processing to handle large BED files without loading the entire dataset into memory. The genomic locations are stored in a Registry object that can exceed available RAM by using disk-backed storage.
See also
convertBetaToTabix for converting standard beta files to tabix format
getBetaHandler for creating a BetaHandler object from processed files
Examples
# Create a simple phenotype data frame
pheno <- data.frame(
sample_group = c("case", "control"),
row.names = c("Sample1", "Sample2")
)
if (nzchar(Sys.which("tabix")) && nzchar(Sys.which("bgzip"))) {
bed_file <- tempfile(fileext = ".bed")
writeLines(c(
"#chrom\tstart\tSample1\tSample2",
"chr1\t100\t0.2\t0.8",
"chr1\t200\t0.3\t0.7"
), bed_file)
result <- readCustomMethylationBedData(bed_file, pheno)
result$tabix_file
}