Boundless Bio San Diego, California, United States
Background: Poor solubility and low bioavailability are major hurdles in small molecule drug discovery, necessitating optimal formulation selection for successful development. Traditional methods relying on trial-and-error approaches often yield suboptimal results. Here we introduce a novel machine learning model to predict and streamline efforts on optimal formulation classes for small molecules. Methods: Preclinical formulations for intraperitoneal, subcutaneous, and oral routes were compiled for a total of 484 small molecules (molecular weight < 1000 Da) from internal sources and PharmaPendium (2024). The molecules were featurized using Morgan fingerprints converted to embeddings via a pretrained Mol2vec model, alongside physicochemical properties, including logP and pKa. A random forest model was trained to predict optimal formulation classes: cyclodextrin, cellulose, PEG, aqueous/buffer, and oils. Formulation appearance and suitability for dosing were experimentally validated using internal compounds and approved drugs. Results: Performance varied across formulation classes, with cyclodextrin and oil-based formulations showing superior Area Under the Receiver Operating Characteristic (AUROC) of 0.87 and 0.74, respectively. Other classes exhibited AUROC from 0.64 to 0.68, likely due to class imbalance. Overall, the model achieved a micro-averaged one-vs-rest AUROC of 0.82 and a top 3 accuracy of 90.2%, indicating reliable rank-ordering of appropriate formulations. Experimental validation using internal compounds and approved drugs showed that model-guided formulation selection consistently outperformed the common cellulose-based standard formulation (e.g., 0.5% methylcellulose/0.2% Tween) in terms of solubility. Conclusion: Machine learning based-models can streamline formulation selection at early-stage drug discovery by reducing trial-and-error, saving time and resources, and enhancing decision-making. Future work will focus on improving class balance through dataset expansion and developing a similar model to guide selection of clinical formulations.