Sitemap

Computational tool for building ML models for predicting peptide bioactivity automatically (https://github.com/IBM/AutoPeptideML).

BioBrigit

Hybrid machine learning and knowledge-based approach for the prediction of metal diffusion pathways through proteins (https://github.com/insilichem/BioBrigit).

Hestia-GOOD

Open source library for evaluating machine learning models in out-of-distribution generalization (https://github.com/IBM/Hestia-GOOD).

Knowledge enhanced representation learning for drug discovery

Published in Proceedings of the AAAI Conference on Artificial Intelligence, 2024

Download Paper

AutoPeptideML: a study on how to build more trustworthy peptide bioactivity predictors

Published in Bioinformatics, 2024

This paper discusses the design of an AutoML tool for building peptide bioactivity predictors and how to ensure their robust evaluation through homology partitioning.

Download Paper

Enhancing foundation models for scientific discovery via multimodal knowledge graph representations

Published in Journal of Web Semantics, 2025

Download Paper

Molecular Modelling in Bioactive Peptide Discovery and Characterisation

Published in Biomolecules, 2025

Download Paper

A new framework for evaluating model out-of-distribution generalisation for the biochemical domain

Published in The Thirteenth International Conference on Learning Representations, 2025

This paper discusses a new framework for evaluating model performance in new data. It offers metrics for choosing the best similarity function for a given biochemical prediction task and for estimating model performance conditioned on a deployment distribution.

Download Paper

BioBrigit, a Hybrid Machine Learning and Knowledge-Based Approach to Model Metal Pathways in Proteins: Application to a Dicopper Tyrosinase

Published in ACS Omega, 2025

This paper presents a new prediction tool for identifying metal ion diffusion pathways within proteins, and demonstrates its capabilities by considering the use case of a dicopper tyrosinase.

Download Paper

How to build machine learning models able to extrapolate from standard to modified peptides

Published in Journal of Cheminformatics, 2025

This paper explores different design choices including learning algorithm and representation technique for building machine learning models that are able to extrapolate from one data distribution (standard peptides) to another (modified peptides). This study opens the door for new drug discovery campaings by allowing scientist to leverage data that is cheaper to acquire to make predictions for more expensive compounds.

Download Paper

AutoPeptideML: automated machine learning for building peptide bioactivity predictors leveraging protein language models

Published: December 05, 2023

Talk introducing the AutoPeptideML library.

Modelos que aprenden el lenguaje de las moléculas y cómo utilizarlos para predecir sus propiedades

Published: March 19, 2024

Keynote talk discussing pre-trained language models and their application in modelling for biomedicine and drug discovery.

Effect of dataset partitioning strategies for evaluating out-of-distribution generalisation for predictive models in biochemistry

Published: August 18, 2024

Download Slides

A new framework for evaluating machine learning in biochemistry and its application for peptides and small molecules

Published: April 01, 2025

Download Slides

A new framework for evaluating model out-of-distribution generalization for the biochemical domain

Published: April 24, 2025

Download Slides

AutoPeptideML2: An open-source library for democratising machine learning for peptide bioactivity prediction.

Published: July 22, 2025

Download Slides

AutoPeptideML2: An open-source library for democratising machine learning for peptide bioactivity prediction.

Published: July 22, 2025

Talk discussing the latest update to the AutoPeptideML library. Presentation given to the Student Council Symposium (SCS) and the Bioinformatics Open Source Conference (BOSC) as part of ISMB 2025.

Download Slides

How to generalize machine learning models to both canonical and non-canonical peptides

Published: August 19, 2025

Talk discussing our work automating Machine Learning models for natural and synthetic peptide property prediction, with the aim of accelerating peptide/peptidomimetic drug development

Download Slides

Partitioning, representation, and automation in canonical and non-canonical peptide modelling

Published: September 01, 2025

Talk discussing the importance of proper dataset partitioning and choice of negative peptides, the choice of representation and how can we automate Machine Learning models for natural and synthetic peptide property prediction, with the aim of accelerating peptide/peptidomimetic drug development.

Download Slides

Evaluation of partitioning algorithms for trustworthy out-of-distribution evaluation of machine learning models in biochemistry.

Published: December 10, 2025

Machine learning models in scientific discovery are expected to make predictions in new, unseen scenarios, i.e., out-of-distribution. Machine learning model evaluation is usually performed by dividing a dataset into two mutually exclusive subsets: training and testing. Model parameters are fitted to the training subset and the evaluation is performed against the testing subset. The process of creating these subsets is called partitioning. Traditionally, the machine learning literature relies on random partitioning. The problem with this approach is that it assumes that the prediction scenario will be in-distribution as random sampling is an in-distribution sampling. Recently, we have introduced the concept of similarity partitioning as a method for correcting this assumption. Similarity partitioning algorithms ensure that the testing subset contains molecules different to those the model has been exposed during training, and thus better simulates the real-world out-of-distribution scenario. However, it is not clear what algorithms are the best suited for generating these testing subsets. Thus, we have conducted a systematic benchmark of different partitioning algorithms previously described in the literature and examined which ones can generate the most challenging test subsets. We also propose a new algorithm called CCPart. Our results show that the three best similarity partitioning algorithms are Butina, CCPart, and UMAP. Where UMAP is limited to small drug-like organic molecules and both Butina and CCPart can be applied to any other entity (biosequences, 3D structures, small molecules, etc.). Further, they also show that choice of partitioning algorithm is dataset-dependent and a prior analysis of both algorithms and similarity metrics need to be performed. These results open the way for more trustworthy evaluation of machine learning models in the biochemical domain, that better estimate their real-world performance.

Download Slides

Deep learning in biomedicine - SECUAH VII

Workshop, University of Alcala de Henares, 2022

I’ve taught a cohort of biosciences students (undergrad and graduate) about artificial intelligence, machine learning, and deep learning techniques and how to apply them to biomedical research with a guided practical example where every student was able to build their own deep convolutional neural network for diagnosing skin lessions as either bening or cancerous. Materials can be found in this Github Repository.

Demonstrator Bioinformatics UCD (MEIN30240)

Undergraduate teaching, University College Dublin, School of Medicine, 2023

I’ve been a Demonstrator in the Bioinformatics UCD Module (MEIN30240) for two years (2023 - 2024).

Deep learning in biomedicine - SECUAH IX

Workshop, University of Alcala de Henares, 2024

Workshop titled “Modelos que aprenden el lenguaje de las moléculas” - Models that learn the language of molecules. The guided practical example allowed every student to finetune MolFormer-XL to build their own small molecule toxicity predictive model. Materials can be found in this Github Repository.

Models that learn biochemistry

Workshop, University of Oviedo, 2025

Workshop titled “Modelos que aprenden bioquímica” - Models that learn biochemistry. This workshop answers the question of what artificial intelligence is and how can it be use din biochemistry. The course spans a wide range of techniques and use cases including the underlying models behind modern chatbots like ChatGPT, and how these technologies can be applied for the modelling of biosequences and small drug-like organic molecules for drug discovery. It also included AI approaches for molecular docking (mainly Diffdock) as well as Molecular Dynamics through Machine-learnt Force Fields. The workshop provided students with a complete experience including practical sessions with guided code examples where they could build their own toxicity predictors from Chemical Language Models, as well as docking antipsychotic drugs to a GPCR, and running code for the simulation of the folding of a peptide with 15 alanines as well as the corresponding analysis of the simulation. The workshop was attended both by undergraduate students from Biology and Biotechnology majors, as well as PhD students from different disciplines.

Raúl Fernández Díaz

Sitemap

Pages

Posts

portfolio

publications

talks

teaching