Getting started with synphage
synphage
is a pipeline to create genome synteny graphics from genbank files.
If you are familiar with Python, you can install synphage
via pip
. If not, we recommend using the docker image
instead.
Installation
Requirements
synphage
is available independently of your operative system, you can run synphage
on:
Linux
MacOS
Windows
The binaries in the python wheel
are built universally so as far as you have a python interpreter with the minimum version >=3.9
you are all set.
Via pip
Users installing synphage with pip, need to have Blast+ installed as well (see Additional dependencies).
# Latest
pip install synphage
For more details, see the Installation guide.
Via docker
docker pull vestalisvirginis/synphage
The Docker image comes with all the dependencies pre-installed. For more details, see the Installation guide.
Additional dependencies
synphage
relies on one non-python dependency that need to be manually installed when synphage is installed with pip:
- Blast+ >= 2.12.0
Command line or install ...
Install from ...
Usage
Setup
synphage
requires:
- to specify a folder path where the genbank
folder will be present and where generated data will be stored;
- a genbank
folder populated with genbank files (.gb
and .gbk
extension are accepted);
- a sequences.csv
file containing the file name and orientation of the sequences to plot.
Warning
Genbank file names should not contain spaces.
Path setup
export DATA_DIR=<path_to_data_folder>
Note
For docker users, this path is defaulted to /data
.
CSV file
genome_1.gb,0
genome_2.gb,1
genome_3.gb,0
Running Synphage
synphage
uses Dagster. In order to run synphage jobs, you need to start dagster first.
Starting Dagster
Set up the environment variable DAGSTER_HOME in order to keep a trace of your previous run. For more information, see Dagster documentation.
export DAGSTER_HOME=<dagster_home_directory>
dagster dev -h 0.0.0.0 -p 3000 -m synphage
Running the jobs
The current software is structured in four different jobs.
- blast
: create the blastn of each sequences against each sequences (results -> gene_identity folder)
- transform
: create three tables from the blastn results and genbank files (results -> tables)
- plot
: create the synteny graph (results -> synteny)
- download
: download genomes to be analysed from the NCBI database
Note
Different synteny plots can be generated from the same set of genomes. In this case the two first jobs only need to be run once and the third job (plot
) can be triggered separately for each graphs.
Output
synphage's output consists of three main parquet files and the synteny graph. However all the data generated by the synphage pipeline are made available in your workng directory.
Generated data architecture
.
├── <path_to_data_folder>/
│ ├── download/
│ ├── genbank/
│ ├── fs/
│ ├── gene_identity/
│ │ ├── fasta/
│ │ ├── blastn_database/
│ │ └── blastn/
│ ├── tables/
│ │ ├── blastn.parquet
│ │ ├── locus_and_gene.parquet
│ │ └── uniqueness.parquet
│ └── synteny/
│ ├── colour_table.parquet
│ └── synteny_graph.svg
└── ...
Tables
The tables
folder contains the three main parquet files generated by the transform
job of synphage.
1. blastn.parquet
contains the collection of the best match for each locus tag/gene against each genomes. The percentage of identity between two genes/loci are then used for calculating the plot cross-links between the sequences.
2. locus_and_gene.parquet
contains the full list of locus tag
and corresponding gene
names when available for all the genomes in the genbank folder. If the genbank file only contains CDS
, the locus tag and gene value are replaced by the protein identifyer protein_id
.
3. uniqueness.parquet
combined both previous data tables in one, allowing the user to quickly know how many matches their gene(s) of interest has/have retrieved. These data are then used to compute the colour code used for the synteny plot. The result of the computation is recorded in the colour_table.parquet
. This file is over-written between each plot
run.
Synteny plot
The synteny plot
is generated as .svg file
and .png file
, and contains the sequences indicated in the sequences.csv file. The genes are colour-coded according to their abundance (percentage) among the plotted sequences. The cross-links between each consecutive sequence indicates the percentage or similarities between those two sequences.
Plotting config options
Field Name | Description | Default Value |
---|---|---|
title |
Generated plot file title | synteny_plot |
colours |
Gene identity colour bar | ["#fde725", "#90d743", "#35b779", "#21918c", "#31688e", "#443983", "#440154"] |
gradient |
Nucleotide identity colour bar | #B22222 |
graph_shape |
Linear or circular representation | linear |
graph_pagesize |
Output document format | A4 |
graph_fragments |
Number of fragments | 1 |
graph_start |
Sequence start | 1 |
graph_end |
Sequence end | length of the longest genome |
Genbank file download
The download
allow to download sequences of interest into the genbank folder to be subsequently processed by the software.
Requirement
Connection to the NCBI databases requires user's email
and api_key
.
export EMAIL=user.email@email.com
export API_KEY=UserApiKey
Query config options
Field Name | Description | Default Value |
---|---|---|
search_key |
Keyword(s) for NCBI query | Myoalterovirus |
database |
Database identifier | nuccore |