OccamzRazor presents two contributions to NeurIPS 2020

Screen Shot 2020-12-11 at 10.30.51 AM.png

We proudly present two contributions to the NeurIPS machine learning conference 2020. Our machine learning engineer Srivamshi Pittala will present our paper Relation-weighted link prediction for disease gene identification and our machine learning intern Jupinder Parmar will present our paper Biomedical Information Extraction for Disease Gene Prioritization at the workshop Knowledge Representation & Reasoning Meets Machine Learning (KR2ML).

Both papers represent major breakthroughs on our way to optimize and combine various machine learning approaches to drug discovery.

The paper lead by Srivamshi explores various ways relationships between genes and diseases can be represented in a knowledge graph. He was able to show that not all relationships in the graph are equal in their contribution to predictive accuracy. By optimizing the weights for various types of relationships, Srivamshi improved the accuracy for predicting new links between genes and diseases. The results show that through careful evaluation of the importance of every relationship type for predictive tasks, we can outperform the competition by 24.1%. Also, a comparison to opentargets.org shows that using knowledge graphs over genetic-focused human target identification lets us predict more targets that have already shown preclinical and clinical success.

Jupinder explored the importance of text-based knowledge to inform our knowledge graph and improve our prediction of new disease-related genes, that can serve as potential drug targets. He focused on physical interactions between proteins in the human body. Many known protein-protein relationships are represented in the database string-db. We used that resource to create a protein-protein graph as part of our pipeline to predict new drug targets. Our experience as biologists suggested that structured databases often do not capture all published relationships, especially from papers that characterize only one or two specific interactions as opposed to large proteomics experiments. Our hypothesis is that if we use natural language processing to capture all those relationships in papers, our predictive capabilities improve. Jupinder showed that this is indeed the case as he got a 20% lift in predictive performance through adding text-based data to the structured core dataset. This confirms our approach of combining text and structured sources to find new treatments.

A high-level overview of our information extraction pipeline from Parmar et al, 2020. We only display the single candidate relation (ARAP2, ARF6) for simplicity although three candidate relations are present.

These findings represent a significant milestone for OccamzRazor, proving our concept and paving our path for clinical success.