Lipid recognition propensities of amino acids in membrane proteins from atomic resolution data

Background Protein-lipid interactions play essential roles in the conformational stability and biological functions of membrane proteins. However, few of the previous computational studies have taken into account the atomic details of protein-lipid interactions explicitly. Results To gain an insight into the molecular mechanisms of the recognition of lipid molecules by membrane proteins, we investigated amino acid propensities in membrane proteins for interacting with the head and tail groups of lipid molecules. We observed a common pattern of lipid tail-amino acid interactions in two different data sources, crystal structures and molecular dynamics simulations. These interactions are largely explained by general lipophilicity, whereas the preferences for lipid head groups vary among individual proteins. We also found that membrane and water-soluble proteins utilize essentially an identical set of amino acids for interacting with lipid head and tail groups. Conclusions We showed that the lipophilicity of amino acid residues determines the amino acid preferences for lipid tail groups in both membrane and water-soluble proteins, suggesting that tightly-bound lipid molecules and lipids in the annular shell interact with membrane proteins in a similar manner. In contrast, interactions between lipid head groups and amino acids showed a more variable pattern, apparently constrained by each protein's specific molecular function.


Background
About 20-30% of all proteins encoded in a typical genome are estimated to be localized in membranes [1,2], where protein-lipid interactions play crucial roles in the conformational stability and biological functions of membrane proteins. Many experimental studies have suggested that physico-chemical properties of the membrane lipid bilayer influence the stability and function of membrane proteins. The thermal [3,4] and chemical [5] stability of the potassium channel KcsA has been shown to vary according to the lipid composition of the membrane bilayer. It has also been shown that the lipid composition affects protein functions including: ion transport in KcsA [6,7] and the Ca 2+ -ATPase of sarcoplasmic reticulum [8,9], phosphorylation by the diacylglycerol kinase [10] and chemical compound transport by the mechanosensitive channel of large conductance MscL [11]. To complement these experimental studies, statistical analyses have been carried out to reveal amino acid preferences and conservation patterns within the lipid bilayer environment [12][13][14][15][16] using available sequence and structural data. The patterns emerging from these statistical analyses should reflect implicitly the effects of lipid molecules on the structural formation and stability of membrane proteins. However, few of the previous computational studies have taken into account the atomic details of protein-lipid interactions explicitly. A notable exception is all-atom molecular dynamics (MD) simulations; it has become possible to apply the technique to membrane proteins in conditions mimicking biological membranes (reviewed recently by Khalili-Araghi and co-authors [17]). All-atom MD simulations enable us to inspect protein-lipid interactions in atomic details [18,19] and can reveal the role of lipids in protein function [20], albeit for a small selection of specific lipid and protein molecules.
In this paper, we attempt to understand the nature of protein-lipid interactions using a computational approach. Given the limited number of crystal structures containing lipid molecules, we decided to combine all known biological phospholipids together and classify the atomic interactions into those involving the "head" and "tail" parts of the lipids. The head and tail groups can be found in most phospholipids constituting a biological membrane and define one of the most essential chemical features of these molecules. Thus, we ask more specifically: "How are the head and tail portions of lipid molecules recognized by amino acid residues in membrane proteins?" To answer this question, we utilized two available data sources, crystal structures and MD trajectories. Using the crystal structure data, we can include and examine various kinds of proteins and lipids, although the number of lipid molecules observed in each solved structure is limited. Using the MD data, we can obtain detailed information about all the lipid molecules surrounding a protein, although such an analysis is possible only for a small set of protein and lipid types. The combination of these two data sources allows us to assess the biases resulting from a limited variety of data in each data source. The results revealed a common pattern of lipid tail-amino acid interactions observed in both the crystal structures and MD trajectories. We show that the recognition of lipid tails can be explained largely by general lipophilicity and that this effect dominates in the two different situations represented by the crystal structure and MD datasets. In contrast, lipid head groups showed a more complicated and diverse pattern and we discuss how our observations can be related to known experimental data and previously proposed concepts concerning protein-lipid interactions.

Lipid definition and dataset
Lipids in this paper were defined as phosphoglycerides that consisted of one or two fatty acids linked through glycerol phosphate to zero or one polar group, and their mimetic compounds. First, an initial list of three-letter HET IDs of lipids in the Protein Data Bank (PDB) [21] was obtained by keyword searches against the Chemical Component Dictionary (CCD) through Ligand Expo [22] and PDBeChem [23] using all the MeSH terms below 'Glycerophosphates' in the MeSH hierarchy. Next, mimetic compounds were found by the "Similar Compound Search" function at PubChem [24] and RCSB PDB [25]. Finally, all the collected compounds were manually checked to determine whether they met the definition of lipids above. A total of 98 HET IDs were collected (Table 1) and used to search for proteins in contact with lipids in the PDB repository (see the next section).

Crystal structure data of protein-lipid complexes
Using the HET IDs listed in Table 1, a local repository of the PDB (updated on February 9, 2011) was scanned for the crystal structures of proteins that contained these lipid molecules. Retaining only those structures solved at 4.0 Å resolution or better (ignoring structures solved by NMR and other methods, for which resolution was unavailable), a total of 290 protein-lipid complexes were obtained initially, consisting of 1,657 chains. Protein chains that were smaller than 30 residues, that contained one or more non-standard amino acid residues (except for selenomethionine, which was treated as MET) and that had no lipid contacts (see below for the definition of contacts) were removed from this set, leaving 1,497 protein chains. These sequences were clustered using the BLASTClust program (available from the BLAST [26] distribution) at a 25% sequence identity cutoff, resulting in 148 clusters. Clusters in which all the members had less than five residues in contact with lipids were discarded. The remaining clusters were classified into transmembrane (TM) and non-transmembrane (non-TM) in the following manner. A cluster was initially annotated as either TM, if any of its members was found in the PDBTM [27] or OPM databases [28] (both downloaded on February 6, 2011), or non-TM otherwise. To confirm the presence (or absence) of TM helices, PDB2TMD [29] was run, followed by manual inspection to ensure that all the proteins were correctly annotated as TM or non-TM. From each cluster, the protein chain with the highest number of lipid-contacting residues was selected as the representative, producing 45 TM and 27 non-TM protein chains (Table 2).
Although the resolution cutoff for data collection has been set to 4.0 Å, the worst resolution of any included structure was 3.7 Å. Also, only two protein chains in the TM data set had worse than 3.5 Å resolution, and only four had worse than 3.0 Å resolution. All the non-TM structures had 3.0 Å or better resolution. Thus, the final list contained most proteins solved at a decent  resolution. All the statistical analyses in this paper were based on these protein chains unless otherwise specified. Although no conscious selection was made, the protein chains in the TM dataset were mostly helical, with the only exception of a beta barrel anion channel protein (PDB:3emn).

Amino acid-lipid contacts and propensity scores
Various types of amino acid-lipid contacts exist in protein-lipid complexes. They were broadly grouped into (1) hydrogen-bonded, (2) van der Waals and (3) salt bridges. These contacts were defined by using the HBPLUS program [34] with the standard atomic radii from the PDB het dictionary [35]. The default definitions of van der Waals interactions and hydrogen bonds were used to identify the amino acid-lipid contacts.
According to the algorithm used in HBPLUS, hydrogen atoms were first added to the protein structure and then a hydrogen bond was identified if (i) the donor-acceptor distance was less than 3.9 Å, (ii) the hydrogen-acceptor distance was 2.5 Å and (iii) all three angles D-H-A, D-A-AA and H-A-AA were greater than 90°. (D, A, H and AA stands for donor, acceptor, hydrogen, and acceptor antecedents, respectively.) For aromatic interactions, the angles D-A-AX and H-A-AX (for amino-aromatic interactions) were also required to be less than 20°. (Further details and a list of acceptor and donor atoms can be found at [36].) The amino acid residue-lipid contacts were further classified into lipid tail and head group contacts. Specifically, the tail group of a lipid was defined as the set of all the atoms from the aliphatic tail to the carbon atom next to the carbonyl group of the fatty acid (or the corresponding carbon atom in a mimetic lipid). The head group of a lipid was defined as all the other atoms. The tail groups are predominantly hydrophobic, while the head groups are hydrophilic. All contact preferences were measured in terms of a propensity score. First, a propensity score for each of the 20 amino acid residues was computed for each protein. The propensity P i of residue type i (e.g., LYS; i = 1 ... 20) in a protein was defined as the relative number of residues of type i in contact with lipids, normalized by the overall relative number of residues in contact with lipids: where N i b is the number of lipid binding amino acid residues of type i, N i is the total number of amino acids of type i, N b is the total number of lipid binding residues and N is the total number of amino acid residues. All the counts were made within the given protein sequence. The propensity values range between 0 and ∞. An amino acid propensity value of 1 indicates a neutral preference to binding lipids, while propensity values of <1 and >1 show a low and high preference, respectively. If a residue type was not represented in a protein chain, its propensity was undefined and excluded from further statistics. If a particular amino acid type was present in the chain but was not binding to lipids, its propensity was 0. Finally, the propensity scores thus computed for each protein chain were averaged over a set of proteins to draw comparison between one set (e. g., TM) and another (e.g., non-TM). The standard error of the mean was estimated as s √ n, where s is the sample standard deviation and n is the sample size (i.e., the number of protein chains in the set considered, for which the propensity was defined).
We derived all the contact statistics from the entire protein chains including the residues in extra-membranous loops, because lipid-contacting residues were found both in the TM helices and loops and also, to make a natural comparison between the TM and non-TM proteins. Focusing only on the TM regions would not change the overall statistics, as most TM proteins considered had only short loops (with the exception of the MD trajectory data for Ca 2+ -ATPase, for which the large extra-membranous domain was excluded from the analysis).

Chi-square test and statistical significance
To determine whether a particular amino acid is statistically significantly over-or under-represented in contact with lipid head or tail atoms, we pooled all the contact counts in the TM or non-TM dataset (considering only those proteins with at least six residues forming a given type of contacts). The expected number E i of lipid binding residues of type i in a given dataset was computed as where N i , N b and N were as above but obtained for the entire dataset. It was then compared with the observed number O i of lipid binding residues of type i by using a Chi-square statistic: The calculated χ i 2 values were converted to p-values using the standard Chi-square table with a single degree of freedom.

Propensity in MD trajectories
To calculate propensity scores from the MD data, a contact was defined using a non-integer value equal to the fraction of the snapshots, in which the amino acid residue under consideration was in contact with any lipid molecule. More precisely, the total number N (k) b of lipid binding counts for the kth amino acid residue in each MD trajectory was defined as where I (k) b (t) is 1 if the kth amino acid residue was in contact with any lipid molecule in snapshot t, and 0 for no contact. For example, within a trajectory of 1,000 snapshots, if ARG90 is observed to be interacting with lipids in 300 snapshots, then N b (ARG90) is 0.3. The total number of lipid binding amino acid residues of type i (i. e., N i b in Eq. 1) can be then obtained by summing up these quantities for all the ARG residues.

Lipophilicity scales of amino acids
Comparisons were made between the lipid propensity scores of residues derived from the TM and MD datasets and the thermodynamic free energy of transferring amino acid residues from water to the interface of POPC bilayer and to octanol. The latter (called the lipophilicity scales in this paper) was taken from the data provided in White and Wimley's paper [37]. For the lipophilicity scales, we kept the protonation states of ARG and LYS positive, ASP and GLU negative and HIS neutral.

Correlation between propensity values of two datasets
Comparisons between residue preferences were made using scatterplots and Pearson's correlation coefficient defined as where X i and Y i represent propensity (or lipophilicity) values of residue type i in two datasets being compared.
The jackknife estimate of the standard error of the correlation coefficient was obtained as: where C (-i) is the correlation coefficient calculated from data with the ith amino acid type removed and <C> is the mean of N (= 20) such values. The square root of the quantity in Eq. 6 was shown as the estimated standard error.

Amino acid propensities from the crystal structure and MD datasets
Amino acid propensities of membrane proteins contacting with lipid head and tail groups were derived from both crystal structures and MD simulations. Figure 1 shows scatterplots between the propensities from the crystal structure and MD datasets. The correlation coefficients between these two were 0.81 and 0.95 for the lipid head and tail group contacts, respectively (see also  Tables 3 and 4). Although good agreements were observed in both the lipid head and tail group contacts, some points in the plot for the head group contacts do not lie close to a straight line (Figure 1a), especially when compared with the plot for the tail group contacts (Figure 1b). When the outliers in the head group plot (TRP, ARG, LYS) were removed, the correlation coefficient rose to 0.88, a value close to that of the tail group without TRP (0.90) (see also Additional file 1, Fig. S1).
The contact preferences for lipid head groups had larger variance among individual proteins than for tail groups (see Table 3 and the Discussion section below). Thus, two of the outliers, LYS and ARG, may be due to the small number of proteins in the MD dataset; ttSe-cYE had more ARG residues than the average in the crystal structure dataset [16], while mjSecYEβ had more LYS residues than the average. All these residues clustered in the membrane interfaces, especially on the cytoplasmic side. Such a bias would have resulted in the higher head propensities of LYS and ARG in the MD dataset, although further analysis is needed to confirm this notion. Particularly high propensities of TRP were observed in both scatterplots, suggesting that TRP residues are more frequently located in the regions that allow direct contacts with lipid molecules than in other regions (see Discussion below).

Specific observations for each amino acid residue
Here, we describe the lipid head and tail group preferences of each amino acid residue observed in both the crystal structure and MD datasets (Table 3).
Only TRP and TYR were favored by both the lipid head and tail groups. These residues, with their amphiphilic nature, play a special role in the membranewater interfaces. The small residues (GLY, SER, THR, ALA, PRO) were excluded from both lipid head and tail groups. Our previous study showed the propensities of the small residues on the protein surface in the TM region and around the membrane interfaces to be low, while those in the buried positions to be high [16]. These residues are thought to stabilize inter-helical contacts through non-conventional hydrogen bonds (Cα-H...O) [16,38]. The acidic residues (ASP, GLU), but not the basic ones (HIS, ARG, LYS), were also excluded from both lipid head and tail groups, consistent with the basic residues to occur favorably on the surface of the intracellular interface [16] (the positive-inside rule [39]).
For lipid head group contacts, hydrophilic residues, both basic (HIS, ARG, LYS) and uncharged polar (ASN, GLN), were favored, except for small (SER and THR) and acidic (ASP, GLU). TRP and TYR were the only hydrophobic residues favored by lipid head groups. For lipid tail group contacts, no hydrophilic residues were   All-against-all correlation coefficients between the properties presented in Table 3. Values in parentheses represent standard error in correlation (see Methods). a The oxidation state of HIS has been taken as neutral. All ARG and LYS are taken as positively and all ASP and GLU are taken as negatively charged.

Comparison with the lipophilicity scales
We then compared the amino acid propensities with the experimentally determined lipophilicity scales, which were derived from transfer free energies of model peptides from water to POPC membrane interface and to bulk octanol [37]. (The correlation coefficients were calculated by using the raw values of the amino acid propensities and the lipophilicity scales, as described in Methods.) The amino acid propensities and the lipophilicity scales are summarized in Table 3, and a comprehensive list of correlation coefficients between the three sets of values is shown in Table 4. The propensities for the tail group atomic contacts, derived from both the crystal structure and MD datasets, were highly correlated with the lipophilicity scales (with the correlation ranging from 0.75 to 0.87, Figure  2). However, the propensities for the head group atomic contacts were poorly correlated with the lipophilicity scales (with the correlation ranging from 0.06 to 0.28). This observation suggests that the lipid tail group propensities can be largely described by the free energy of transfer of model peptides.

Comparison with non-TM data
Amino acid propensities for contacting with lipids were derived also from a set of non-TM proteins and compared with those derived from the TM dataset. A summary of the Chi-square statistics for lipid contacts of all 20 amino acid residues in the TM and non-TM proteins is presented in Table 5.
Despite some small differences in the degree of preference (e.g., ASN contacts with lipid head groups being statistically significant only in the TM dataset), no amino acids were exclusively preferred in either dataset. Out of the 40 comparisons in Table 5 (for 20 amino acids in each type of contacts), only two occurrences were found such that the number of observed contacts was higher than expected in TM and lower than expected in non-TM or vice versa (GLY for the head group contacts and CYS for the tail group contacts).
To summarize, we found that an almost identical set of amino acids were used to form lipid contacts in the TM and non-TM proteins, with only small differences in the statistical significance of over-or underrepresentation.

Discussion
We showed that the patterns of membrane protein-lipid interactions obtained from both the crystal structures and MD trajectories were highly correlated with each other (Figure 1). We also showed that the recognition of lipid tail groups by amino acid residues can be described by the lipophilicity scales (Table 4) and had the same tendency with non-TM proteins (Table 5), while lipid head groups demonstrated considerable variation among individual proteins. We discuss here how our observations can be associated with existing experimental data and previously proposed concepts concerning proteinlipid interactions. We also elaborate on the high propensities of TRP residue for the membrane protein-lipid interface.

Relation of Amino acid propensities to lipid-membrane protein interaction
Since membrane proteins are generally crystallized with detergent molecules used for solubilization and purification, the lipid molecules that remain in the crystal are considered those that are tightly bound to the membrane proteins. On the other hand, the lipid molecules in the first shell, also known as the annular shell around a membrane protein, are in direct contact with the protein and form weak and non-specific interactions according to spin-label EPR and fluorescence quenching experiments [40,41]. Thus, intuitively, the amino acid propensities from the crystal structures should correspond to propensities for interacting with tightly-bound lipid molecules, while those from the MD trajectories should correspond to propensities for interacting weakly with lipid molecules in the annular shell (although some of these lipid molecules can be tightly bound). It is, therefore, non-trivial that we have observed such a high level of correlation between the propensities derived from these two datasets ( Figure 1). Assuming that the tight binding of lipids is achieved by forming a special binding pocket on the surface of a protein, the amino acid composition of such binding pockets appears to be no different from that of other surface positions. This result implies that no special chemical interaction is required for achieving the tight binding of at least the tail portion of lipid molecules, but transmembrane helix packing may create a specific binding pocket for specific lipid types for the protein's function. Experimental studies of the potassium channel KcsA [4,42] suggest that the tightly-bound lipids can be essential for its stability and function. The amino acid residues that interact with these tightly-bound lipids must have been selected during the course of evolution. However, our results suggest that these amino acids have been selected not necessarily based on their ability to form special chemical interactions with lipid tails but rather, they are general lipid-binding surface amino acids and happened to have been utilized for offering a physical basis of strong interaction.
For the head group contacts, although the TM and non-TM datasets produced a similar trend (Table 5), a weaker correlation was observed between the propensities derived from the crystal structure and MD datasets than that for the tail group contacts (Figure 1). The difference between the head and tail contacts may be attributable to the larger standard error for the propensities for the head contacts ( Table 3). The propensity values were computed for each protein and then averaged and thus, the larger standard error indicates a larger variance among the propensity values derived from different proteins. Indeed, a variety of modes of interaction have been observed between the protein and lipid head groups in our dataset. Head groups of lipids often show disorder in high-resolution X-ray structures even when their tail groups are observed [40,43]. In our dataset, the head groups of tightly-bound lipids were completely or mostly disordered in rhodopsin (1gzm_A), sensory rhodopsin (1xio_A), succinate:ubiquinone oxidoreductase SQR (2h89_C) and halorhodopsin (3a7k_A); and fully or partially observed but not forming any hydrogen bond in bacteriorhodopsin (1x0i_1), SQR (1zoy_D), V-Type Na + -ATPase (2bl2_I) and ligand-gated ion channel GLIC (3eam_C). In other cases, the head groups appeared and formed hydrogen bonds, while the tail groups were disordered in Ca 2+ -ATPase (2eau_A), rhomboid protease GlpG (2irv_B), potassium channel Kir (2wll_D) and nitrate reductase A NarGHI (3egw_C).
Experimental studies have shown that differences in the chemical composition of the lipid head group affect the stability and function of membrane proteins, including KcsA, MscL, Ca 2+ -ATPase and others. Considering all these observations, the role of lipid head-protein interactions is likely to vary among different types of membrane proteins and this notion is consistent with the head contact propensities obtained in this paper, which were diverse and more complex than the tail contact propensities.
Concentration of TRP at a lipid-water interface for anchoring the protein to the membrane In both the crystal structure and MD datasets, we observed a conspicuously high propensity of TRP residues for contacting lipid molecules (Figure 1), indicating that TRP favors positions in a membrane protein that allow interaction with lipids.
Although TRP is generally not an abundant residue, either in membrane or soluble proteins [16], TRP has been reported to occur frequently near the membrane boundaries [44][45][46], as confirmed by our recent statistical analysis [16]. Systematic experimental studies using model peptides and proteins have also produced a similar picture [47][48][49][50]. (See Killian and von Hejine [51] for a review and examples of high-resolution structures are found in Lee [40].) The amphiphilic nature of TRP (and also TYR) residues explains why TRP favors to locate at a water-lipid interface; these amphiphilic residues are thought to be locking the membrane protein into the correct location and orientation like anchors or floats at the membranewater interface. Sansom and colleagues have observed the interfacial anchoring behavior of the amphiphilic residues in their MD simulations of both the outer membrane protein OmpA and the potassium channel KcsA [18].
All indications are that the significantly high propensities in Figure 1 were obtained as a consequence of the combined effect of the general low abundance and the amphiphilic nature of TRP.

Conclusions
We analyzed lipid preferences of membrane proteins at atomic resolution, which were divided into those for lipid head and tail groups, by using a combination of data from crystal structures and MD simulations. The results revealed a common pattern of lipid tail-amino acid interactions in both datasets, suggesting that tightly-bound lipid molecules and lipids in the annular shell interact with membrane proteins in a similar manner, largely explained by general lipophilicity. On the other hand, lipid head-amino acid interactions showed a more complicated and variable pattern and are likely to affect the specific function of individual proteins. We also showed that TM and non-TM proteins utilize essentially an identical set of amino acids for interacting with lipid head and tail groups.

Additional material
Additional file 1: This file includes Figure S1.
Author details