2. A few concepts to define¶

Before to jump in the next section, let’s agree on a few term and their associated concepts we’ll use.

2.1. Genes Adjacency¶

Two genes are adjacent if and only if they are on the same chromosome (or DNA molecule) and next to each other on the DNA sequence. Obviously, this concept is only relevant for genes from the same genome.

Now, let \(S1\) be a set of genes from a species 1 and \(S2\) be a set of genes from a species 2. Then let

\[S1 \times S2 = \{( (a1, b1), (a2, b2) )\}\]

with

\(a1 \in S1\),
\(b1 \in S1\),
\(a2 \in S2\),
\(b2 \in S2\),
\(a1\) and \(a2\) are orthologous,
\(b1\) and \(b2\) are orthologous.

This corresponds to the step 3 of the General Process.

When collecting the pairs of orthologous from \(S1 \times S2\), we can filter the pairs of pairs of genes based on the adjacency in \(S1\) and \(S2\). The following situations are possible:

The genes are adjacent in both pairs (\(a1\) and \(b1\) are adjacent as are \(a2\) and \(b2\)).
The genes are not adjacent in both pairs (\(a1\) and \(b1\) are not adjacent as are \(a2\) and \(b2\)).
The genes are adjacent in \(S1\) but not in \(S2\) (\(a1\) and \(b1\) are adjacent while \(a2\) and \(b2\) are not adjacent).
The genes are not adjacent in \(S1\) but are in \(S2\) (\(a1\) and \(b1\) are not adjacent while \(a2\) and \(b2\) are adjacent).

We define the following strategies of collecting such pairs of pairs of genes:

All pairs are collected without taking the adjacency into account; we call this case all.
The pairs are collected only when both genes pairs are not adjacent; we call this case none.
The pairs are collected only when the genes of one genes pair are adjacent while the genes of the other genes pair are not; we call this case xor.
The pairs are collected only when at least one of the genes pairs has its genes adjacent; we call this case or.
The pairs are collected only when the genes are adjacent in both genes pairs; we call this case and.

2.2. Values selection mode¶

The result of the step 3 of the General Process is a list of values for each pairs of pairs of genes between each pairs of genomes. This list can contain the same pairs of orthologs among the different pairs of genomes or not. This depends on the orthologs chosen in the first place and on the availability of contacts data along each genomes.

The previous facts can have the following effect: different pairs of genomes raises (partially or totally) different pairs of orthologs. Since we cannot know a priori if this is an issue or not, we put names on the different situations in order to make it possible to work with each of them in step 4 of the General Process.

These situations are:

All values are kept, whatever they are present in all or just a subset of the pairs of species; we call that situation union.
Only the values that are present in all pairs of species are kept; we call that situation intersection.
Only the values that are present in at least two pairs of species (so at least 3 species) are kept; we soberly call that situation atLeastTwo.

2.3. Randomization¶

An important part of this work is to test whether or not a phylogenetic signal is present in the contact data. In order to achieve that goal, we need to apply the method to the actual data, then to a randomized set of data, and finally to compare them.

Let’s talk about what we call a randomized set of data.

2.3.1. Scrambling matrices¶

The contact data are represented as matrices, each axis being the coordinates along a chromosome. Thus, a box in such a matrix corresponds to the contacts between the two chromosomes at the given coordinates.

Now, let’s take a particular box (that is a pair of genes) in a matrix. Scrambling that matrix multiple times will make the contacts in this box tend to the mean number of contacts of the whole matrix.

Consequently, by scrambling a matrix, we lose the structural information. Thus, comparing the results obtained using actual with the ones obtained from multiple scrambling allows us to look for a phylogenetic signal.

The scrambling is done at step 2 of the General Process. Technically, we use the Fisher–Yates shuffle [Durstenfeld1964] to scramble the matrices.

2.3.2. Bootstrap¶

At some point during the scientific process, we performed a step of bootstrap. The bootstrap has been introduced by Efron [Efron1979] and applied to phylogenies by Felsenstein [Felsenstein1985].

We used it on the pairs of genes after joining (step 3 of General Process). The idea was to make multiple distances matrices by bootstraping, then to compute the actual distance matrix. After that, we inferred all phylogenetic trees, and compared them.

However, after discussions it appeared we couldn’t interpret our results. The code that produced those results is still in the repo, in the hope that the methodology could be fixed.

[Durstenfeld1964]

Durstenfeld R., “Algorithm 235: Random permutation”, Communications of the ACM, Volume 7 Issue 7, July 1964, Page 420, doi: 10.1145/364520.364540

[Efron1979]

Efron B., “Bootstrap Methods: Another Look at the Jackknife”, The Annals of Statistics, Volume 7, Number 1 (1979), Pages 1-26, doi: 10.1214/aos/1176344552

[Felsenstein1985]

Felsenstein J., “Confidence Limits on Phylogenies: an Approach Using the Bootstrap”, Evolution, 1985, 39, 783-791, doi: 10.1111/j.1558-5646.1985.tb00420.x