Sitemap
A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.
Pages
Posts
portfolio
AutoPeptideML
Computational tool for building ML models for predicting peptide bioactivity automatically (https://github.com/IBM/AutoPeptideML).
BioBrigit
Hybrid machine learning and knowledge-based approach for the prediction of metal diffusion pathways through proteins (https://github.com/insilichem/BioBrigit).
Hestia-GOOD
Open source library for evaluating machine learning models in out-of-distribution generalization (https://github.com/IBM/Hestia-GOOD).
publications
Knowledge enhanced representation learning for drug discovery
Published in Proceedings of the AAAI Conference on Artificial Intelligence, 2024
AutoPeptideML: a study on how to build more trustworthy peptide bioactivity predictors
Published in Bioinformatics, 2024
This paper discusses the design of an AutoML tool for building peptide bioactivity predictors and how to ensure their robust evaluation through homology partitioning.
Enhancing foundation models for scientific discovery via multimodal knowledge graph representations
Published in Journal of Web Semantics, 2025
Molecular Modelling in Bioactive Peptide Discovery and Characterisation
Published in Biomolecules, 2025
A new framework for evaluating model out-of-distribution generalisation for the biochemical domain
Published in The Thirteenth International Conference on Learning Representations, 2025
This paper discusses a new framework for evaluating model performance in new data. It offers metrics for choosing the best similarity function for a given biochemical prediction task and for estimating model performance conditioned on a deployment distribution.
BioBrigit, a Hybrid Machine Learning and Knowledge-Based Approach to Model Metal Pathways in Proteins: Application to a Dicopper Tyrosinase
Published in ACS Omega, 2025
This paper presents a new prediction tool for identifying metal ion diffusion pathways within proteins, and demonstrates its capabilities by considering the use case of a dicopper tyrosinase.
How to build machine learning models able to extrapolate from standard to modified peptides
Published in Journal of Cheminformatics, 2025
This paper explores different design choices including learning algorithm and representation technique for building machine learning models that are able to extrapolate from one data distribution (standard peptides) to another (modified peptides). This study opens the door for new drug discovery campaings by allowing scientist to leverage data that is cheaper to acquire to make predictions for more expensive compounds.
talks
AutoPeptideML: automated machine learning for building peptide bioactivity predictors leveraging protein language models
Published:
Talk introducing the AutoPeptideML library.
Modelos que aprenden el lenguaje de las moléculas y cómo utilizarlos para predecir sus propiedades
Published:
Keynote talk discussing pre-trained language models and their application in modelling for biomedicine and drug discovery.
A new framework for evaluating model out-of-distribution generalization for the biochemical domain
Published:
AutoPeptideML2: An open-source library for democratising machine learning for peptide bioactivity prediction.
Published:
Talk discussing the latest update to the AutoPeptideML library. Presentation given to the Student Council Symposium (SCS) and the Bioinformatics Open Source Conference (BOSC) as part of ISMB 2025.
How to generalize machine learning models to both canonical and non-canonical peptides
Published:
Talk discussing our work automating Machine Learning models for natural and synthetic peptide property prediction, with the aim of accelerating peptide/peptidomimetic drug development
Partitioning, representation, and automation in canonical and non-canonical peptide modelling
Published:
Talk discussing the importance of proper dataset partitioning and choice of negative peptides, the choice of representation and how can we automate Machine Learning models for natural and synthetic peptide property prediction, with the aim of accelerating peptide/peptidomimetic drug development.
Evaluation of partitioning algorithms for trustworthy out-of-distribution evaluation of machine learning models in biochemistry.
Published:
Machine learning models in scientific discovery are expected to make predictions in new, unseen scenarios, i.e., out-of-distribution. Machine learning model evaluation is usually performed by dividing a dataset into two mutually exclusive subsets: training and testing. Model parameters are fitted to the training subset and the evaluation is performed against the testing subset. The process of creating these subsets is called partitioning. Traditionally, the machine learning literature relies on random partitioning. The problem with this approach is that it assumes that the prediction scenario will be in-distribution as random sampling is an in-distribution sampling. Recently, we have introduced the concept of similarity partitioning as a method for correcting this assumption. Similarity partitioning algorithms ensure that the testing subset contains molecules different to those the model has been exposed during training, and thus better simulates the real-world out-of-distribution scenario. However, it is not clear what algorithms are the best suited for generating these testing subsets. Thus, we have conducted a systematic benchmark of different partitioning algorithms previously described in the literature and examined which ones can generate the most challenging test subsets. We also propose a new algorithm called CCPart. Our results show that the three best similarity partitioning algorithms are Butina, CCPart, and UMAP. Where UMAP is limited to small drug-like organic molecules and both Butina and CCPart can be applied to any other entity (biosequences, 3D structures, small molecules, etc.). Further, they also show that choice of partitioning algorithm is dataset-dependent and a prior analysis of both algorithms and similarity metrics need to be performed. These results open the way for more trustworthy evaluation of machine learning models in the biochemical domain, that better estimate their real-world performance.
teaching
Deep learning in biomedicine - SECUAH VII
Workshop, University of Alcala de Henares, 2022
I’ve taught a cohort of biosciences students (undergrad and graduate) about artificial intelligence, machine learning, and deep learning techniques and how to apply them to biomedical research with a guided practical example where every student was able to build their own deep convolutional neural network for diagnosing skin lessions as either bening or cancerous. Materials can be found in this Github Repository.
Demonstrator Bioinformatics UCD (MEIN30240)
Undergraduate teaching, University College Dublin, School of Medicine, 2023
I’ve been a Demonstrator in the Bioinformatics UCD Module (MEIN30240) for two years (2023 - 2024).
Deep learning in biomedicine - SECUAH IX
Workshop, University of Alcala de Henares, 2024
Workshop titled “Modelos que aprenden el lenguaje de las moléculas” - Models that learn the language of molecules. The guided practical example allowed every student to finetune MolFormer-XL to build their own small molecule toxicity predictive model. Materials can be found in this Github Repository.
Models that learn biochemistry
Workshop, University of Oviedo, 2025
Workshop titled “Modelos que aprenden bioquímica” - Models that learn biochemistry. This workshop answers the question of what artificial intelligence is and how can it be use din biochemistry. The course spans a wide range of techniques and use cases including the underlying models behind modern chatbots like ChatGPT, and how these technologies can be applied for the modelling of biosequences and small drug-like organic molecules for drug discovery. It also included AI approaches for molecular docking (mainly Diffdock) as well as Molecular Dynamics through Machine-learnt Force Fields. The workshop provided students with a complete experience including practical sessions with guided code examples where they could build their own toxicity predictors from Chemical Language Models, as well as docking antipsychotic drugs to a GPCR, and running code for the simulation of the folding of a peptide with 15 alanines as well as the corresponding analysis of the simulation. The workshop was attended both by undergraduate students from Biology and Biotechnology majors, as well as PhD students from different disciplines.