Synphage output data

All the data generated during synphage runs are stored in the data directory. (see Data Output setup for pip, docker desktop or docker users)

synphage's output consists of four to six main parquet files (depending if blastn and blastp were both executed) and the synteny graphic. However all the data generated by the synphage pipeline are made available in your data directory.

synphage data architecture

The folders and files are organised as follow inside the data directory:

.
├── <path_to_synphage_folder>/
│   ├── download/
│   ├── fs/
│   ├── genbank/
│   ├── gene_identity/
│   │   ├── fasta_n/
│   │   ├── blastn_database/
│   │   └── blastn/
│   ├── protein_identity/
│   │   ├── fasta_p/
│   │   ├── blastp_database/
│   │   └── blastp/
│   ├── tables/
│   │   ├── genbank_db.parquet
│   │   ├── processed_genbank_df.parquet
│   │   ├── blastn_summary.parquet
│   │   ├── blastp_summary.parquet
│   │   ├── gene_uniqueness.parquet
│   │   └── protein_uniqueness.parquet
│   ├── sequences.csv
│   └── synteny/
│      ├── colour_table.parquet
│      ├── synteny_graph.png
│      └── synteny_graph.svg
└── ...

Main files

The most relevant files to the users are located in the ./table and ./synteny directory.

Tables

The tables folder contains the four to six main parquet files generated by the pipeline.
1. genbank_db.parquet : original data parsed from the GenBank files.
2. processed_genbank_df.parquet : data processed during the validation step. It contains two additional columns:
- gb_type : specifying what type of data is used as unique identifier of the coding elements
- key: unique identifier based on the columns: filename, id and locus_tag
3. blastn_summary.parquet : data parsed from the blastn output json files. It contains the collection of the best match for each sequence against each genomes. The percentage of identity between two sequences are then used for calculating the plot cross-links between the sequences.
4. blastp_summary.parquet : data parsed from the blastp output json files. It contains the collection of the best match for each sequence against each genomes. The percentage of identity between two sequences are then used for calculating the plot cross-links between the sequences.
5. gene_uniqueness.parquet : combines both processed_genbank_df.parquet and blastn_summary.parquet in a single parquet file, allowing the user to quickly know how many matches their sequence(s) of interest has/have retrieved. These data are then used to compute the colour code used for the synteny plot. The result of the computation is recorded in the colour_table.parquet. This file is over-written between each plot run.
6. protein_uniqueness.parquet : combines both processed_genbank_df.parquet and blastp_summary.parquet in a single parquet file, allowing the user to quickly know how many matches their sequence(s) of interest has/have retrieved. These data are then used to compute the colour code used for the synteny plot. The result of the computation is recorded in the colour_table.parquet. This file is over-written between each plot run.

How to read parquet files

parquet files can be read and manipulated with any DataFrame API of choice, such as Pandas, Apache Spark, Polars, DuckDB but also in a non-programmatic manner using softwares such as Tad.

Synteny

The synteny folder contains one parquet file and to graphical outputs.
1. colour_table.parquet: results of the colour-code computation used for the plot. This table is computed for each plot as it is based on the sequences used for plotting the graph.
2. synteny_graph: is generated as .svg file and .png file, and contains the sequences indicated in the sequences.csv file. The genes are colour-coded according to their abundance (percentage) among the plotted sequences. The cross-links between each consecutive sequence indicates the percentage of similarities between those two sequences.

Synteny diagram

blastn datablastp data

Synteny diagram based on blastn data

Synteny diagram based on blastp data