Metadata

Adding metadata to your directory is so important!
Because now you know what the data is about or what you did with it, but will you still remember this in three years?

As stated multiple times before, it is strongly recommended to add metadata files in your directories: these documentation files contain critical context about how/when your data was generated, processed, etc.

The more detailed your metadata file is, the more you:

  • help your future self understand your own work months or even years later
  • enable colleagues or successors to build upon your research findings effectively
  • increase findability & reproducibility of your findings
  • fulfil the requirements for data sharing and publication (!)

Storing your DMP alongside your data can serve as a foundational metadata document: this file contains even more important information and can serve as a very handy reference point for anyone accessing the dataset.

Examples of information in a metadata file

  • Sequencing machine
  • Settings for data generation
  • Pipeline used to analyse the data
  • Parameters used for certain tools
  • Location of the source code
  • Versioning of tools that are used

To help with this we have made two options:

Script for generating metadata file

Input of metadata txt file you can copy

Content of the files

  • Script that you can run
      #!/bin/bash
    
      # Prompt for location and define the metadata file path
      read -p "Enter location: " LOCATION
      mkdir -p  "$LOCATION"
    
      # Collecting project name
      read -p "Fill in the Project Name (without spaces): " PROJECT_NAME
      METADATA_FILE="${LOCATION}/METADATA_${PROJECT_NAME}.txt"
      echo "Metadata file: ${METADATA_FILE}"
      # Project Information
      echo "| Field                   | Description                                                                          |" >> $METADATA_FILE
      echo "|-------------------------|--------------------------------------------------------------------------------------|" >> $METADATA_FILE
      echo "| Project Name            | ${PROJECT_NAME}" >> $METADATA_FILE
    
      read -p "Give a small project description:  " PROJECT_DESCRIPTION
      echo "| Project Description     | ${PROJECT_DESCRIPTION}" >> $METADATA_FILE
    
      read -p "Start date of the project: "  PROJECT_START_DATE
      echo "| Start Project Date      | ${PROJECT_START_DATE}" >> $METADATA_FILE
    
      read -p "Current status of the project (e.g., submitted, in progress, completed): " PROJECT_STAT
      echo "| Project Status          | ${PROJECT_STAT}" >> $METADATA_FILE
    
      # User Information
      read -p "Name of the user requesting the service: " USER_NAME
      echo "| User Name               | ${USER_NAME}" >> $METADATA_FILE
    
      read -p "Email address of the user requesting the service: " USER_EMAIL
      echo "| User Email              | ${USER_EMAIL}" >> $METADATA_FILE
    
      read -p "Principal investigator: " PRI_INV
      echo "| Principal Investigator  | ${PRI_INV}" >> $METADATA_FILE
    
      read -p "Collaborator: "  COLLAB
      echo "| Collaborator            | ${COLLAB}" >> $METADATA_FILE
    
      # Service Request Information
      read -p "Type of bioinformatics service requested (e.g., sequencing, analysis, consultation): " SERVICE_TYPE
      echo "| Service Type            | ${SERVICE_TYPE}" >> $METADATA_FILE
    
      # Sample Information
      read -p "Type of biological sample (e.g., DNA, RNA, whole genome, exome): " SAMPLE_TYPE
      echo "| Sample Type             | ${SAMPLE_TYPE}" >> $METADATA_FILE
    
      read -p "Organism from which the sample was obtained: " ORGANISM
      echo "| Organism                | ${ORGANISM}" >> $METADATA_FILE
    
      read -p "Cell line (if applicable): " CELL_LINE
      echo "| Cell Line               | ${CELL_LINE}" >> $METADATA_FILE
    
      # Sequencing Information
      read -p "Library prep (if applicable): " LIBRARY_PREP
      echo "| Library  prep           | ${LIBRARY_PREP}" >> $METADATA_FILE
    
      read -p "Sequencing technology used (e.g., Illumina, PacBio, Oxford Nanopore): " SEQ_PLAT
      echo "| Sequencing Platform     | ${SEQ_PLAT}" >> $METADATA_FILE
    
      read -p "Specific sequencing instrument used (e.g., HiSeq 2500, NovaSeq 6000): " SEQ_INSTR
      echo "| Sequencing Instrument   | ${SEQ_INSTR}" >> $METADATA_FILE
    
      read -p "Length of the sequencing reads (e.g., 100 bp, 150 bp): " READ_LENGTH
      echo "| Read Length             | ${READ_LENGTH}" >> $METADATA_FILE
    
      read -p "Indicates whether the sequencing was paired-end or single-end: " PAIRED_OR_SINGLE
      echo "| Paired or Single-End    | ${PAIRED_OR_SINGLE}" >> $METADATA_FILE
    
      read -p "Average sequencing depth or coverage: " SEQ_DEPT
      echo "| Sequencing Depth        | ${SEQ_DEPT}" >> $METADATA_FILE
    
      # Data Information
      read -p "Format of the input data (e.g., FASTQ, BAM, VCF): " DATA_FORMAT
      echo "| Data Format             | ${DATA_FORMAT}" >> $METADATA_FILE
    
      read -p "Location where the input data is stored: " DATA_LOCATION
      echo "| Data Location           | ${DATA_LOCATION}" >> $METADATA_FILE
    
      # Analysis Information
      read -p "Type of bioinformatics analysis requested (e.g., alignment, variant calling, RNA-seq): " ANA_TYPE
      echo "| Analysis Type           | ${ANA_TYPE}" >> $METADATA_FILE
    
      read -p "Specific parameters or settings used for the analysis: " ANA_PARAMS
      echo "| Analysis Parameters     | ${ANA_PARAMS}" >> $METADATA_FILE
    
      read -p "Reference genome used for the analysis (e.g., hg38, mm10): " REF_GENOME
      echo "| Reference Genome        | ${REF_GENOME}" >> $METADATA_FILE
    
      read -p "Format of the output data (e.g., BAM, VCF, CSV): " OUT_FORMAT
      echo "| Output Format           | ${OUT_FORMAT}" >> $METADATA_FILE
    
      read -p "Location where the output data will be stored: " OUT_LOCATION
      echo "| Output Location         | ${OUT_LOCATION}" >> $METADATA_FILE
    
      # Billing Information
      read -p "Funding: " FUNDING
      echo "| Funding                 | ${FUNDING}" >> $METADATA_FILE
    
      # Analyst Information
      read -p "Name of the bioinformatician or analyst working on the service: " ANA_NAME
      echo "| Analyst Name            | ${ANA_NAME}" >> $METADATA_FILE
    
      read -p "If published, link to publication: "  PUB_LINK
      echo "| Publication Link        | ${PUB_LINK}" >> $METADATA_FILE
    
      read -p "If data in public repository, link to repository: "   PUB_REPO
      echo "| Public Repository Link  | ${PUB_REPO}" >> $METADATA_FILE
    
      # Additional Information
      read -p "Additional comments or special instructions: " COMMENTS
      echo "| Comments                | ${COMMENTS}" >> $METADATA_FILE
    
      echo "Metadata collection complete. The details have been saved to ${METADATA_FILE}."
    
  • A text file that you can copy
      | Field                   | Description                                                                          |
      |-------------------------|-----------------------------------------------------------|
      | Project Name            | teste_metadata_script
      | Project Description     | testing the metadatascript
      | Start Project Date      | 29/04/2026
      | Project Status          | in progress
      | User Name               | Marie Hannaert
      | User Email              | marie.hannaert@uantwerpen.be
      | Principal Investigator  | Arvid
      | Collaborator            | Lauren Moons
      | Service Type            | analysis
      | Sample Type             | DNA
      | Organism                | human
      | Cell Line               | /
      | Library  prep           | /
      | Sequencing Platform     | Oxford Nanopore
      | Sequencing Instrument   | /
      | Read Length             | /
      | Paired or Single-End    | PE
      | Sequencing Depth        | 10x
      | Data Format             | FASTQ
      | Data Location           | LTS hopefully
      | Analysis Type           | variant calling
      | Analysis Parameters     | /
      | Reference Genome        | hg38
      | Output Format           | BAM
      | Output Location         | here needed LTS
      | Funding                 | none
      | Analyst Name            | Lauren Moons
      | Publication Link        | /
      | Public Repository Link  | not EGA
      | Comments                | test test