A few concepts to define ======================== Before to jump in the next section, let's agree on a few term and their associated concepts we'll use. Genes Adjacency --------------- Two genes are `adjacent` if and only if they are on the same chromosome (or DNA molecule) and next to each other on the DNA sequence. Obviously, this concept is only relevant for genes from the same genome. Now, let :math:`S1` be a set of genes from a species 1 and :math:`S2` be a set of genes from a species 2. Then let .. math:: S1 \times S2 = \{( (a1, b1), (a2, b2) )\} with * :math:`a1 \in S1`, * :math:`b1 \in S1`, * :math:`a2 \in S2`, * :math:`b2 \in S2`, * :math:`a1` and :math:`a2` are orthologous, * :math:`b1` and :math:`b2` are orthologous. This corresponds to the step 3 of the :ref:`general_process`. When collecting the pairs of orthologous from :math:`S1 \times S2`, we can filter the pairs of pairs of genes based on the adjacency in :math:`S1` and :math:`S2`. The following situations are possible: * The genes are adjacent in both pairs (:math:`a1` and :math:`b1` are adjacent as are :math:`a2` and :math:`b2`). * The genes are *not* adjacent in both pairs (:math:`a1` and :math:`b1` are *not* adjacent as are :math:`a2` and :math:`b2`). * The genes are adjacent in :math:`S1` but not in :math:`S2` (:math:`a1` and :math:`b1` are adjacent while :math:`a2` and :math:`b2` are *not* adjacent). * The genes are *not* adjacent in :math:`S1` but are in :math:`S2` (:math:`a1` and :math:`b1` are *not* adjacent while :math:`a2` and :math:`b2` are adjacent). We define the following strategies of collecting such pairs of pairs of genes: * All pairs are collected without taking the adjacency into account; we call this case ``all``. * The pairs are collected only when both genes pairs are *not* adjacent; we call this case ``none``. * The pairs are collected only when the genes of one genes pair are adjacent while the genes of the other genes pair are not; we call this case ``xor``. * The pairs are collected only when at least one of the genes pairs has its genes adjacent; we call this case ``or``. * The pairs are collected only when the genes are adjacent in both genes pairs; we call this case ``and``. Values selection mode --------------------- The result of the step 3 of the :ref:`general_process` is a list of values for each pairs of pairs of genes between each pairs of genomes. This list can contain the same pairs of orthologs among the different pairs of genomes or not. This depends on the orthologs chosen in the first place and on the availability of contacts data along each genomes. The previous facts can have the following effect: different pairs of genomes raises (partially or totally) different pairs of orthologs. Since we cannot know *a priori* if this is an issue or not, we put names on the different situations in order to make it possible to work with each of them in step 4 of the :ref:`general_process`. These situations are: * All values are kept, whatever they are present in all or just a subset of the pairs of species; we call that situation ``union``. * Only the values that are present in all pairs of species are kept; we call that situation ``intersection``. * Only the values that are present in at least two pairs of species (so at least 3 species) are kept; we soberly call that situation ``atLeastTwo``. Randomization ------------- An important part of this work is to test whether or not a phylogenetic signal is present in the contact data. In order to achieve that goal, we need to apply the method to the actual data, then to a randomized set of data, and finally to compare them. Let's talk about what we call *a randomized set of data*. Scrambling matrices ^^^^^^^^^^^^^^^^^^^ The contact data are represented as matrices, each axis being the coordinates along a chromosome. Thus, a box in such a matrix corresponds to the contacts between the two chromosomes at the given coordinates. Now, let's take a particular box (that is a pair of genes) in a matrix. Scrambling that matrix multiple times will make the contacts in this box tend to the mean number of contacts of the whole matrix. Consequently, by scrambling a matrix, we lose the structural information. Thus, comparing the results obtained using actual with the ones obtained from multiple scrambling allows us to look for a phylogenetic signal. The scrambling is done at step 2 of the :ref:`general_process`. Technically, we use the Fisher--Yates shuffle [Durstenfeld1964]_ to scramble the matrices. Bootstrap ^^^^^^^^^ At some point during the scientific process, we performed a step of bootstrap. The bootstrap has been introduced by Efron [Efron1979]_ and applied to phylogenies by Felsenstein [Felsenstein1985]_. We used it on the pairs of genes after joining (step 3 of :ref:`general_process`). The idea was to make multiple distances matrices by bootstraping, then to compute the actual distance matrix. After that, we inferred all phylogenetic trees, and compared them. However, after discussions it appeared we couldn't interpret our results. The code that produced those results is still in the repo, in the hope that the methodology could be fixed. .. [Durstenfeld1964] Durstenfeld R., "Algorithm 235: Random permutation", Communications of the ACM, Volume 7 Issue 7, July 1964, Page 420, doi: 10.1145/364520.364540 .. [Efron1979] Efron B., "Bootstrap Methods: Another Look at the Jackknife", The Annals of Statistics, Volume 7, Number 1 (1979), Pages 1-26, doi: 10.1214/aos/1176344552 .. [Felsenstein1985] Felsenstein J., "Confidence Limits on Phylogenies: an Approach Using the Bootstrap", Evolution, 1985, 39, 783-791, doi: 10.1111/j.1558-5646.1985.tb00420.x