In the vast reservoir of protein structures accessible through the Research Collaboratory for Structural Biology (RCSB) database, over 30,000 structures are available, with thousands added annually. A parallel growth is seen in small molecule structures stored in databases like ZINC and PubChem, raising the question of how to optimize their utilization in drug discovery.
Ligand docking programs, particularly computational methods, are now fundamental in drug discovery. These methods fall into two categories: stochastic (e.g., AutoDock2, DockVision, GOLD) and systematic (e.g., FlexX, DOCK, FLOG). These programs face the challenge of accurate scoring functions, affecting their ability to estimate relative binding affinities.
This review focuses on the eHiTS docking tool, presenting results from a validation study involving 1629 protein-ligand complexes. It also delves into the scoring function issue, discussing a novel approach introduced in the latest version of eHiTS.
The eHiTS Method
eHiTS adopts a distinct strategy, not only in its docking algorithm but also in its inventive scoring function approach. It disassembles ligands into rigid fragments and flexible chains, systematically docking each fragment independently. The compatibility of fragment poses is determined through a hypergraph matching algorithm, yielding multiple potential binding poses. An ultimate decision on the best pose set is based on the collective score of all fragments, avoiding a piecemeal approach. Flexible chains are then fitted to specific rigid fragment poses, forming an initial ligand conformation.
Geometric Shape and Chemical Feature Graph
eHiTS’s ligand fragmentation aims at separating rigid fragments from flexible linkers. This involves identifying rotatable bonds and forming simplified geometric hulls around molecular fragments, associating chemical properties with vertices.
Rigid Fragment Docking
Each rigid fragment is systematically docked within the receptor cavity. The interaction between rigid fragments and the receptor is scored based on a matrix of interaction pairs, computing a pose’s score by summing applicable scores of interacting surface points. This approach allows for information reuse when identical rigid fragments appear in different ligands.
Pose Matching
Pose matching involves selecting pose sets with compatible fragment distances. A graph is constructed, and edges are established between nodes that fulfill specific conditions, creating maximal cliques representing unique docking solutions.
Flexible Chain Fitting
Flexible chains that connect rigid fragments are then fitted. Since two atom positions at the chain ends are fixed, the optimal chain configuration is chosen from a lookup table. A deterministic minimization approach fine-tunes the chain’s configuration to achieve precise endpoint alignment.
Scoring Functions Overview
Predicting binding affinities of ligands within a receptor is challenging but crucial for virtual ligand screening. Various scoring functions have been developed to balance computational time and result accuracy. These scoring functions fall into three categories: force-field based, empirical, and knowledge-based.
Knowledge-Based Scoring Functions
These scoring functions use statistics from experimentally determined protein-ligand complexes to extract rules about preferred and non-preferred atomic interactions. They are designed to reproduce binding poses rather than binding energies. Examples include PMF, DrugScore, and SMoG.
Empirical Scoring Functions
Empirical scoring functions sum a set of functions parameterized to fit experimental data, such as binding energies. They approximate binding energies by combining individual terms representing various interactions like Van der Waals forces, electrostatics, and hydrogen bonds. Examples include ChemScore, LUDI, F-Score, SCORE, X-Score, and Fresno.
Force-Field-Based Scoring Functions
Similar to empirical scoring functions, force-field-based scoring functions predict ligand binding energies by summing individual contributions from different types of interactions. However, they use interaction terms derived from physical-chemical phenomena rather than experimental affinities. Examples include D-Score, G-score, GOLD, AutoDock, and DOCK.
eHiTS Scoring Approach
The eHiTS program employs a unique scoring function called eHiTS_Score. It uses a statistically derived empirical scoring function and considers temperature factors from crystal structures. The scoring function takes into account surface point interactions between ligand fragments and receptor binding sites. It uses four variables (distance, angles, and torsions) to describe interactions accurately.
Training the Scoring Function
The eHiTS_Score is trained using interaction statistics collected from high-resolution protein-ligand complex structures. These statistics are used to generate a 4D probability field that forms the basis of the scoring function. Interaction probabilities are converted into energies using scaling factors, and various terms such as steric clash, depth, and entropy loss are included in the final scoring function.
Family-Based Scoring
The eHiTS method introduces family-based scoring functions. Receptors are clustered into families based on their active site properties and residues. Different weight sets are used for each family to improve accuracy. Training is automated, and family coverage scoring is incorporated.
Training Dataset
For training, 1315 receptor-ligand complexes are used, with 71 families and one default weight set. The training set includes both clustered and singleton complexes.
Reusing Docking Results via Database
eHiTS docks rigid fragments of ligands independently throughout a receptor site, allowing the docking information to be reused for subsequent ligands if the fragment is repeated. By storing docking information in a database, eHiTS can read this data instead of recalculating it, saving time. The database’s effectiveness was tested by screening 5000 ligands against an estrogen receptor (pdb code 1ERR). The use of the database significantly reduced the time taken to dock each ligand, and the speed-up levelled off as the database was populated with docking results for repeating rigid fragments.
Pocket Detection
Identifying binding sites in proteins is crucial for virtual ligand screening (VLS). The eHiTS program employs an automatic pocket detection algorithm based on pocket-depth values associated with 3D grid cells around the receptor structure. The grid is generated within a 3D bounding box, and the algorithm calculates depth values for surface points using a multi-phase flood-fill approach. A negative flood determines depth values for the pockets, allowing identification of the deepest pocket. The algorithm is fast and does not rely on chemical perception, enabling eHiTS to be used for new therapeutic targets whose active sites are unknown.
Importance of Depth Values
The depth values computed by the algorithm for points within the pocket volume are essential for subsequent scoring. These values help determine the correct binding site and contribute to accurate ligand docking predictions.
Application in Drug Discovery
The pocket detection algorithm is valuable in drug discovery, especially for new targets with unknown active sites. With the rapid growth of determined protein structures but limited co-crystallized complex structures, automatic pocket detection enables the identification of potential ligands tailored to the protein’s cavity.
Automatic Protonation State Evaluation
Protonation states of ligands and receptors are crucial for accurate ligand binding predictions, as different protonation states can lead to different binding poses. However, many docking programs overlook this issue and require users to define protonation states beforehand. eHiTS takes a unique approach by systematically evaluating all possible protonation states for receptors and ligands for each receptor-ligand pair. It uses ambiguous property flags to handle positions that could be protonated or deprotonated, evaluating and scoring each state during the docking process. This approach allows eHiTS to determine the best protonation state for each interaction without relying on combinatorial effects.
Validation Study
For validation, a diverse set of 1626 complexes was selected, including both a training set (884 complexes) and a test set (742 complexes). The training set consists of drug-like ligands split from their receptors, used to train the eHiTS scoring function. The test set, collected from the Protein Data Bank (PDB), contains complexes with ligands that weren’t part of the training set. No manual preprocessing was performed on the ligands or receptors, and eHiTS handled protonation states, cofactors, solvent molecules, etc., automatically.
The ligands from the test set were docked into their protein binding sites, and the accuracy was measured using the root-mean-squared deviation (RMSD) between the predicted ligand pose and the crystal structure. Results were reported for the top-ranked pose and the best-found pose. The study showed that the family-based approach produced better results than a globally trained scoring function. The family-recognized complexes docked with higher accuracy, demonstrating the effectiveness of the approach.
Virtual Ligand Screening Study
To demonstrate the impact of a family-based trained scoring function on virtual ligand screening, data from Cummings et al. were utilized. This data compared the screening performance of four docking tools on five target proteins. However, due to unavailable data, two of the target proteins were excluded from the study. The screening was conducted on three target proteins: human immunodeficiency virus protease (HIV-Pr), protein tyrosine phosphate 1b (PPT1b), and thrombin.
Two screening runs were performed for each protein target. The first run used the globally trained eHiTS Score scoring function (labelled “eHiTS unbiased” in the plots). The second run utilized the family-based scoring function specific to each identified receptor (labelled “eHiTS Family” in the plots). The results clearly indicated that the family-trained scoring function outperformed the globally trained one, achieving higher enrichment results. The family-trained function better captured the characteristics of the receptor binding sites, leading to improved screening outcomes.
Discussion
The eHiTS docking program has been demonstrated to accurately reproduce bound ligand poses from X-ray structures across various receptor families. The innovative approach to the scoring function appears promising, with the potential for further improvement and customization to different receptor or ligand types.
The algorithm used by eHiTS, which docks rigid fragments independently within receptor sites, eliminates seed bias and allows for reusing docking information in subsequent runs, improving docking speed. The automatic handling of protonation states enables testing of all possible states for receptor-ligand pairs, leading to the reporting of the most appropriate form based on the scoring function. Overall, eHiTS offers a distinct docking experience with full automation, minimal preparation requirements for receptor and ligand structures, and accurate results thanks to a highly specific family-trained scoring function. Academic users can access eHiTS for free. Further exploration and analysis are necessary to fully understand the capabilities and limitations of the new scoring function.