RNA-Seq Analysis Data Collection Template
1. Raw Sequencing Data
Section titled “1. Raw Sequencing Data”File Information
Section titled “File Information”- FASTQ Files: Location and format of raw sequencing reads
(
.fastq,.fastq.gz, etc.) - File Naming Convention: How are your files named?
(e.g.,
sample_condition_replicate_R1.fastq.gz) - File Organization: Are files organized in folders by sample, condition, or other structure?
Sequencing Strategy
Section titled “Sequencing Strategy”- Read Type: Single-end (SE) or paired-end (PE) sequencing
- Number of Files: Total count of FASTQ files
- File Sizes: Approximate sizes of your largest files (for storage planning)
2. Sequencing Method Details
Section titled “2. Sequencing Method Details”Technical Specifications
Section titled “Technical Specifications”- Sequencing Platform: Illumina, PacBio, Oxford Nanopore, or other
- Sequencing Instrument: Specific model (e.g., Illumina NovaSeq 6000, HiSeq 4000)
- Read Length: Average read length and length distribution
- Sequencing Depth: Expected coverage per sample (e.g., 30M reads per sample)
- Library Preparation Method:
- Stranded or unstranded RNA-seq
- Library preparation kit used
- Any modifications to standard protocols
Quality Information
Section titled “Quality Information”- RNA Integrity Number (RIN): Available RIN scores for samples
- Quality Metrics: Any existing quality control data
- Known Issues: Any batch effects, contamination concerns, or technical problems
- Adapter Sequences: If known, specify adapter sequences for trimming
3. Species and Reference Genome Information
Section titled “3. Species and Reference Genome Information”Organism Details
Section titled “Organism Details”- Species Name: Scientific name and common name
- Taxonomic ID: NCBI taxonomy ID (if known)
- Strain/Subspecies: If applicable, specify strain information
Reference Genome Preferences
Section titled “Reference Genome Preferences”- Genome Assembly Version: Specific version you want to use
- Reference Database Preference:
- RefSeq (NCBI’s reference sequence database)
- Ensembl (EMBL-EBI’s genome database)
- Other (specify)
- Assembly Accession: Specific genome assembly ID (e.g., GCF_000001405.39 for human GRCh38.p13)
- Annotation Version: GTF/GFF file version that matches the genome assembly
- Annotation Source: Same database as genome (e.g., both from Ensembl) or different
Alternative References
Section titled “Alternative References”- Custom Genome: If you have a custom reference genome
- Transcriptome: If you prefer transcriptome-based quantification
- Organism Database: Any specific organism annotation database (e.g., org.Hs.eg.db for human)
4. Sample Information
Section titled “4. Sample Information”Experimental Design
Section titled “Experimental Design”- Sample IDs: Unique identifiers for each sample
- Sample Descriptions: Brief descriptions of what each sample represents
- Experimental Conditions/Groups:
- Treatment groups (e.g., control vs treated)
- Time points (e.g., 0h, 6h, 24h)
- Tissue types or cell lines
- Any other experimental variables
Replicates and Statistics
Section titled “Replicates and Statistics”- Replicate Information:
- Number of biological replicates per condition
- Number of technical replicates (if any)
- Replicate type (biological vs technical)
- Sample Relationships: How samples are related (e.g., same individual, different time points)
Metadata
Section titled “Metadata”- Batch Information: Any batch effects or technical variables
- Sample Collection Details: Date, location, processing method
- RNA Extraction Method: Kit used, storage conditions
- Additional Variables: Age, sex, genotype, or other relevant metadata
Sample Sheet Format
Section titled “Sample Sheet Format”Please provide a tab-separated values (TSV) file with columns:
sample_id condition replicate time_point tissue batch other_metadatasample_1 control 1 0h liver batch1 age_30sample_2 treatment 1 6h liver batch1 age_305. Reference Files (if providing your own)
Section titled “5. Reference Files (if providing your own)”Genome Files
Section titled “Genome Files”- Reference Genome FASTA: Complete genome sequence file
- Gene Annotation GTF/GFF: File containing gene models and transcript information
- Transcriptome FASTA: If using transcriptome-based quantification
- Index Files: Pre-built index files (if available)
Custom Annotations
Section titled “Custom Annotations”- Custom Gene Sets: Any specific gene sets relevant to your study
- Pathway Databases: Custom pathway annotations
- Regulatory Elements: Promoter regions, enhancers, etc.
- Custom Annotation Databases: Any organism-specific databases
6. Analysis Parameters
Section titled “6. Analysis Parameters”Differential Expression Analysis
Section titled “Differential Expression Analysis”- Comparison Groups: Define the groups to be compared
- Primary comparisons (e.g., treatment vs control)
- Secondary comparisons (e.g., time course analysis)
- Statistical Thresholds:
- False discovery rate (FDR) cutoff (default: 0.05)
- P-value cutoff (default: 0.05)
- Log2 fold change threshold (default: 1.0)
- Expression Thresholds:
- Minimum read count per gene (default: 10)
- Minimum expression level for inclusion
Quality Control Parameters
Section titled “Quality Control Parameters”- Quality Score Thresholds: Minimum quality scores for base calling
- Adapter Trimming: Preferences for adapter removal (not currently supported natively by the workflow)
- Read Filtering: Minimum read length after trimming (not currently supported natively by the workflow)
- Mapping Quality: Minimum mapping quality scores
Enrichment Analysis Preferences
Section titled “Enrichment Analysis Preferences”- Gene Set Databases:
- Gene Ontology (GO): Biological Process, Molecular Function, Cellular Component
- KEGG pathways
- Reactome pathways
- Custom gene sets
- Enrichment Methods:
- Over-Representation Analysis (ORA)
- Gene Set Enrichment Analysis (GSEA)
- Other methods (specify)
- Statistical Corrections: Multiple testing correction method (BH, Bonferroni, etc.)
7. Quality Control and Visualization Preferences
Section titled “7. Quality Control and Visualization Preferences”Quality Control
Section titled “Quality Control”- QC Thresholds: Acceptable quality scores and filtering criteria
- Coverage Requirements: Minimum coverage for gene detection
- Contamination Checks: Any specific contamination screening needed
- Batch Effect Correction: Methods for handling batch effects
Visualization and Reporting
Section titled “Visualization and Reporting”- QC Plots: Types of quality control plots desired
- Expression Plots: Volcano plots, MA plots, heatmaps
- Enrichment Visualizations: Network plots, pathway diagrams
- Report Format: HTML reports, PDF summaries, interactive dashboards
- Data Export: RDS files, TSV tables, and HTML reports for further analysis
8. Analysis Goals and Expectations
Section titled “8. Analysis Goals and Expectations”Research Questions
Section titled “Research Questions”- Primary Objectives: Main research questions to be addressed
- Secondary Analyses: Additional analyses of interest
- Validation Plans: How results will be validated (qPCR, Western blot, etc.)
Output Requirements
Section titled “Output Requirements”- File Formats: Preferred output formats (CSV, TSV, Excel, R objects)
- Data Sharing: Requirements for sharing results with collaborators
- Reproducibility: Level of detail needed for methods section
- Integration: Plans to integrate with other datasets or analyses
9. Additional Information
Section titled “9. Additional Information”Previous Analyses
Section titled “Previous Analyses”- Previous RNA-seq: Any previous RNA-seq analyses on similar samples
- Expected Results: Any known or expected gene expression changes
- Control Genes: Housekeeping genes for normalization validation
Special Requirements
Section titled “Special Requirements”- Custom Scripts: Any custom analysis scripts to be incorporated
- External Tools: Any specific tools or databases to be used
- Publication Standards: Any journal-specific requirements for analysis
- Data Privacy: Any restrictions on data handling or storage
Submission Checklist
Section titled “Submission Checklist”Please ensure you have provided:
- [ ] Raw FASTQ files or access instructions
- [ ] Sample metadata table (TSV format)
- [ ] Species and genome assembly information
- [ ] Experimental design details
- [ ] Analysis parameters and thresholds
- [ ] Quality control preferences
- [ ] Output format requirements
- [ ] Timeline and deadline information
Next Steps
Section titled “Next Steps”Once you’ve completed this template:
- Review with collaborators to ensure all information is accurate and complete
- Proceed to Configuration to set up your analysis parameters
- Follow the Installation guide to set up the workflow
- Use the Running guide to execute your analysis
Linked external resources are independent of TUCCA and Tufts University and remain under their own licenses.