3. Preparing Data and Running an Analysis

Here are the steps needed to run one analysis (you can also read one experiment). By this, we mean measuring the distance between all species for a fixed set of species, a fixed set of orthologous genes and a fixed set of “ready-to-use” Hi-C datasets (understand corrected for experimental biases and normalized). However, the pipeline does allow you to compute the distances for several Hi-C resolutions/binsizes.

3.1. Prepare the data

  1. Prepare the gene locations files for each species, in usual BED format.
  2. Prepare the orthologs file in the appropriate format.
  3. Prepare Hi-C data in the appropriate format.
  4. Prepare the configuration file, in YAML, using the commented sample. This file is used by SnakeMake, so keep it safely. Using the same config file with the same datasets guarantees to re-compute the very same results.

3.2. Running the pipeline

You just need to launch SnakeMake (with a configuration file named config.yaml):

snakemake --configfile config.yaml


You can ask SnakeMake to perform multiple steps at once, if possible. For example, to use 6 jobs at the same time:

snakemake --configfile config.yaml -j6


SnakeMake can also automagically spans its jobs on a clustering system. However, be aware that this functionality is system-dependant. Here is a basic example with an SGE scheduler:

snakemake --configfile config.yaml -j6 --cluster 'qsub -o outfile -e errfile'

3.3. Results

The pipeline outputs a distance matrix in PHYLIP format called all_replicates.phylip. It also creates a number of intermediate files that are kept in case other analysis should be performed. This files are :

  • the pairs files which are described here,
  • the values files which are described here,
  • the stats file which is described here.