Identification of pseudogenesnext section
Our present search for pseudogenes showed that the KLK2 gene, which is under positive selection in some primates, has also been lost in cattle, horse and mouse (examples of traces of pseudogenes in Fig. 1). Genes encoding BSPH1 and -2 have also been lost in human, chimpanzee and dog. Figure 2 recapitulates all the events of gene loss across species (right).
Fig. 2. Major/diagnostic proteins present in seminal fluids and phylogenetic results showing positive selection or gene loss. Only one parenthesis before Druart. The list of major proteins in seminal fluids of different mammals comes from our recent work (Druart et al., 2013); see Material and Methods section. The positive selection for the KLK2 gene was previously shown (Marques et al., 2012), as well as the loss of the TGM4 gene in cattle, pig and dog (Tian, Pascal, Fouchécourt, et al., 2009), and of the SAL1 gene in human (Meslin et al., 2011). ND: not determined.
Table 1. List of the 20 proteins studied. The proteins were chosen based on their predominance in the seminal plasma from at least one species among nine species of domestic placental mammals, based on their relative abundance after SDS PAGE and Coomassie staining (Druart et al., 2013). RNAse10 and MFGE8 are specific markers of epididymal maturation in ungulates. For species whose genome is not fully sequenced (camel, alpaca, goat, sheep), the human or the bovine sequences of the proteins were used for analyses.
Inference of positive selection
We studied the evolution of the proteins identified by proteomic/orbitrap analysis in seminal fluid from domestic animals as previously described in human (Batruch et al., 2011; Milardi et al., 2012). We chose the proteins that are also shown to be highly expressed in the seminal fluid of at least one of the sampled domestic animal species (bull, ram, billy goat, boar, stallion and rabbit); see Methods section and Druart et al. (2013). None of the studied genes exhibited positive selection on site, but two showed significant positive selection on branch-site (Table 2; Fig. 2): KLK1 in human, cattle, mouse and horse (KLK1E2 in horse is one of the three co-orthologs of human KLK1), and CRISP3 in rabbit. Both genes highly expressed in (and markers of) epididymis are also under positive selection: RNase10 in the stem of Fereuungulata (which includes carnivorans, perissodactyls and artiodactyls) and in the stem of the sampled artiodactyls, and MFGE8 in a stem-hominid, and in dog (Fig. 2). It was not possible to determine without ambiguity if genes encoding proteins of the SPADH/AQN family and of the BSP family evolved under positive selection, due to the particularly high divergence of protein sequences, impairing accurate and unambiguous alignment.
Position of amino acids under positive selection in the structure of the proteins
The 3D structures of the proteins (or protein domains) were modelled when their sequences could be significantly related to, and aligned with those of proteins with known 3D structures. Sequence identities range between 29 and 94 %, giving rise to models with accuracy at least comparable to low-resolution experimental structures. This qualitative analysis showed that most of the positively selected positions (40 out of 42 for which 3D structure information is available) are located in regions exposed to solvent, suggesting site-specific selective pressures reflecting the functional context, rather than a structural constraint (Table 3).
Table 3. Position of the amino acids under positive selection in the 3D structure models. Positions on the 3D structures were assessed by computing the accessible solvent area (ASA) and through visual inspection.
For some genes (KLK1 and 2, MFGE8), our tests clearly reject the null hypothesis that the amino acids under positive selection are randomly located in the proteins; far more appear to be exposed to the solvent than expected under the null hypothesis (Table S8). For RNAse10, the null hypothesis could not be rejected, although this probably reflects lack of power linked to the small amount of data for this gene. For CRISP3, the probability could not be computed at all. Globally, our results clearly suggest that more amino acids under positive selection are exposed to the solvent than expected by chance alone, although a global probability could not be computed because one of the individual tests yielded a probability not distinguishable from zero (hence, this probability could not be multiplied with the others). Binomial and Fisher’s exact tests gave very similar results.
For illustration purpose, we present here the inferred 3D structures for MFGE8 and KLK, as they have multiple positions under positive selection that might be of functional importance for the binding properties of MFGE8 and the protease activity of KLK.
According to domain databases, MFGE8, also known as lactadherin, contains two EGF-like domains, followed by a tandem of discoidin/F5/8 type C domains (C1 and C2). The positively selected positions of the EGF-like domains are exposed to solvent, without clustering in a particular region of the surface exposed to the solvent. The Q43 position, in the vicinity of the integrin-binding RGD motif, is included in a large loop, within the second EGF-like domain. The two Discoidin/F5/8C domains bind to anionic phospholipids of cellular membranes (Raymond et al., 2009). We know several 3D structures of F5/8 type C domains of lactadherins (bovine – pdb 2pqs, 3bn6) or related proteins (bovine factor Va - pdb 1sdd, human factor VIII – pdb 3cdz, 2r7e, human neuropilin – pdb 2qqj, 2qqk, 2qqm, 2 qqo, 2orx,…). We selected a template in which the two domains are present in tandem (rather than the isolated F5/8 type C domains of lactadherin), in order to get information on domain interface. The chosen template, bovine factor Va - pdb 1ssd (Adams et al., 2004), has the best score according to the Phyre fold recognition program (E-value 1.6 10-16), and shares 29 % amino acid sequence identity with human MFGE8. The alignment was manually refined (Fig. 3A) and the positions of the positively selected sites were reported on the obtained 3D model of the human MFGE8 C1 and C2 domains (Fig. 3B). Interestingly, several sites under positive selection (92R, 94T, 149L, 152H, 214T, 259L, 279V, 281G, 285N, 312S) are located within or in the vicinity of the three β-hairpin loops (referred to as ‘spikes’) of the two F5/8 type C domains, which are aligned in an edge-to-edge configuration. These spikes form pockets that are thought to allow interaction with phospholipids and membrane interaction (Shao et al., 2008).
Domain databases indicate that KLK has the typical fold of chymotrypsin-like proteases, consisting of two beta-barrels, which form a cleft in which the catalytic triad is formed. We considered the experimental 3D structure of horse KLK1E2 (pdb (1gvz; Carvalho et al., 2002)) for mapping the positions of amino acids under positive selection in KLK1, KLK1E2 and KLK2 from different species (Fig. 4). Our results suggest that five out of the six amino acids of KLK are located on the protein surface, one in the first beta-barrel domain (KLK2 H56, the amino acid equivalent to A56 in bovine KLK2) and four in the second one (KLK2 L159, T200 and N240, as well as Q172, the amino acid equivalent to Y172 in mouse KLK1). None of these are involved in the active site, or in the kallikrein loop (in orange in Fig. 3), which has a direct role in the control and selectivity of the enzyme activity. None of these residues (or equivalent ones) are likely to be involved in ligand binding, when considering the structure of mouse KLK1 in complex with NGF (Bax et al., 1997); data not shown). The remaining sixth amino acid under positive selection (KLK2 F243, the amino acid equivalent to A244 in KLK1 one) contributes to the hydrophobic core of the second beta-barrel, and is located within a strand forming a wall of the enzymatic pocket.
Relationship between abundance in seminal fluid and positive selection
Only three genes (CRISP3, KLK1, and MFGE8) exhibit variation in both abundance of their product protein in the seminal fluid and in the presence of positive selection. Thus, when contrasting pairs of characters for which both vary within pairs (which is only possible on binarised data, in the Mesquite implementation), only one positive and two negative pairs were found, which is consistent with a random association (p = 0.5). When using the ‘most pairs’ selector in Mesquite (which draws the highest number of pairs, irrespective of character state), only one positive and two neutral pairs were found, which is also not significant (p = 1). Pagel’s test yielded lower probabilities (Pagel, 1994), with the lowest being for KLK1, but this is not significant (p. = 0.14; S 1, sheets ‘Pagel 1994 test’ and ‘Abundance, selection correl.’). The lack of correlation is also confirmed by a visual inspection of the evolutionary changes implied by the data, as this can best be done from mirror trees. For instance, for KLK 1 (Fig. 5), positive selection is found (in increasing number of amino acids) in the mouse, cow, and horse, but of these three taxa, only the horse has increased abundance of the gene product in its semen (in all other taxa in our sample, it is absent or in low abundance). In MFGE8 (Fig. 6), the dog (one amino acid) and humans (22 amino acids) exhibit positive selection, but only the cow has an abundance of the gene product in its semen.
Fig. 5. Lack of correlation between abundance (left) and number of amino acids under positive selection in KLK1. Parsimony optimizations performed in Mesquite (Maddison and Maddison, 2015).
Fig. 6. Lack of correlation between abundance (left) and number of amino acids under positive selection in MFGE8. Parsimony optimizations performed in Mesquite (Maddison and Maddison, 2015).
Rate of pseudogenisation
The sampled phylogenetic biodiversity is 613 Ma. Our data imply minimally 6 pseudogenisation events, which gives a global rate (for the 20 considered genes) of about 0.0098 events/lineage/Ma, or a rate of about 0.00048 pseudogenisations/lineage/gene/Ma.