1. Introduction and General Process

1.1. Main Idea

We want to compare the shape of DNA between homologous genomic regions.

The shape of DNA cannot be directly observed on a genome-wide basis, thus we use the only experiment that offers an approaching information: genome-wide chromosome conformation capture, also called Hi-C [Lieberman2009].

The determination of homologous genomic regions is a complex problem. In order to simplify it and be able to work more easily with genomes of different size (around 150Mpb for fly to compare with the multiple Gpb of Human!), we chose to use orthologous genes. Basically, orthologous genes (or orthologs) are homologous genes found in different species. For more information on how to get them, please refer to the Quest for Orthologs consortium website.

Let a1 and b1 be two genes in the genome 1; a2 and b2 be two genes in the genome 2. a1 and a2 are orthologs, so are b1 and b2. The idea is to look at the contacts between a1 and b1, and to compare them with the contacts between a2 and b2. For that we compute the ratio between their number of contacts.

To completely compare two genomes, we compute the ratios for all possible pairs of orthologs.

1.2. General Process

The basic process consists of the following steps:

  1. Prepare the data. Basically, this means selecting the orthologous genes that will be used, checking the orthology relationships (only 1-1), getting Hi-C contacts, taking care of the Hi-C data normalization.
  2. Look at the contacts for all pairs of genes. This is done inside a genome. Thus, this step has to be repeated for each analyzed genome.
  3. Join the contacts between two genomes. This joining is done by using the orthology relationships: the genes a1 and b1 in the genome 1 are joined to their orthologs counterparts a2 and b2 in the genome 2. The term joining here is just a placeholder for the performed mathematical operation ; as said earlier, in our case we just compute a ratio. The process is applied to all pairs of orthologs in both genome 1 and 2. This step has to be repeated for each pair of analyzed genome.
  4. Make the distances matrix. This is done by computing a distance for each pair of genome using the results of the previous step.
  5. Infer a phylogeny using the distances matrix computed at the step 4. For this step, regular distance-based methods and tools are used. Specifically, we used the BioNJ algorithm [Gascuel1997] implemented in the FastME tool [Lefort2015].

Disconnecting steps 2 and 3 enables parallel computation. Since step 2 is computed on a per-species basis, all species can be run in parallel. Likewise, all pairs of species can be run in parallel for step 3. in this pipeline, the parallel computations are handled automagically with Snakemake [Köster2012] and will be made explicit in the next sections.

1.3. Parameters

There are three main parameters that affect the results.

  1. Species and orthologs.
  2. Hi-C resolution, also called binsize.
  3. Hi-C threshold.

The effect of changing the species is obvious. Changing the orthologs means looking at different genomic regions. While this will obviously change the experiment results, we cannot predict in which way. Thus, the whole pipeline has not been designed to save computation time when changing species and/or orthologs.

Changing the Hi-C resolution enables more precise results. Indeed, when the binsize is smaller (and the Hi-C resolution is higher), one can better distinguish contact locations. This leads to a better map of which gene contacts which one. However, lowering binsize is a double-edge sword. Since binning is done in order to get more signal, lowering it dilutes that signal. This can potentially lead to a loss of information and eventually wrong results. Finding the good resolution depends on the data and a lot of tries. In this pipeline, the choice of a binsize is done at step 2. One can run in parallel multiple steps 2 for the same species, each one with a different binsize.

The comparison of the contacts for a pair of orthologs between two species is done using a ratio. We usually don’t want to be sensitive to the order of magnitude of the number of contacts. That is why we chose a ratio. However, if the contacts from both species are really small, this can mean that they are not informative (not enough signal). In order to not being affected by this flaw, a threshold is applied on the contact data during step 3. Testing for different threshold values can be done by running in parallel different steps 3 with different threshold values.

[Gascuel1997]Gascuel O., “BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data.”, Molecular Biology and Evolution. 1997 14:685-695.
[Lieberman2009]Lieberman-Aiden E., van Berkum N. L., Williams L., Imakaev M., Ragoczy T., Telling A., Amit I., Lajoie B. R., Sabo P. J., Dorschner M. O., Sandstrom R., Bernstein B., Bender M. A., Groudine M., Gnirke A., Stamatoyannopoulos J., Mirny L. A., Lander E. S., Dekker J. “Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome” Science 09 Oct 2009, Vol. 326, Issue 5950, pp. 289-293 doi: 10.1126/science.1181369
[Köster2012]Köster J. and Rahmann S., “Snakemake — a scalable bioinformatics workflow engine”, Bioinformatics, Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, doi: 10.1093/bioinformatics/bts480
[Lefort2015]Lefort V., Desper R. and Gascuel O., “FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program”, Molecular Biology and Evolution, Volume 32, Issue 10, 1 October 2015, Pages 2798–2800, doi: 10.1093/molbev/msv150