Prediction Robustness and Data Redundancy in Machine Learning for Materials Science
Kangming Li a, Daniel Persaud a, Kamal Choudhary b, Brian DeCost b, Michael Greenwood c, Jason Hattrick-Simpers a
a Department of Materials Science and Engineering, University of Toronto, Canada
b Material Measurement Laboratory, National Institute of Standards and Technology, USA
c Canmet MATERIALS, Natural Resources Canada, Canada
Materials for Sustainable Development Conference (MATSUS)
Proceedings of MATSUS Spring 2024 Conference (MATSUS24)
#AI - Automation and Nanomaterials (machine learning, artificial intelligence, robotics, accelerated discovery)
Barcelona, Spain, 2024 March 4th - 8th
Organizers: Ivan Infante and Oleksandr Voznyy
Invited Speaker, Kangming Li, presentation 151
DOI: https://doi.org/10.29363/nanoge.matsus.2024.151
Publication date: 18th December 2023

The rapid growth of big data in materials science has led to significant advancements in materials property prediction by machine learning (ML) models. However, big data does not necessarily lead to robust prediction performance of ML models. In addition, the issue of information redundancy in materials data has been largely overlook. This talk intends to present an examination of these two correlated challenges related to materials data: prediction robustness and data redundancy.

First, we will discuss the challenges in ensuring the prediction robustness of ML models, by showcasing the severe performance degradation when the models are trained on the Materials Project 2018 dataset and tested on the Materials Project 2021 dataset. We will demonstrate the impact of distribution shifts and use tools such as UMAP and query-by-committee to foresee performance degradation and to improve prediction accuracy. Next, we will delve into the issue of data redundancy across large materials datasets, revealing that up to 95% of materials data can be safely removed with little impact on the model performance. We will highlight the application of uncertainty-based active learning algorithms to create smaller but informative datasets, leading to more efficient data acquisition and ML training. By examining these challenges, this talk aims to provide insights into building more efficient and robust materials databases and ML models for accurate and reliable predictions in materials science.

The computations were made on the resources provided by the Calcul Quebec, Westgrid, and Compute Ontario consortia in the Digital Research Alliance of Canada (alliancecan.ca), and the Acceleration Consortium (acceleration.utoronto.ca) at the University of Toronto. We acknowledge funding provided by Natural Resources Canada’s Office of Energy Research and Development (OERD). 

© FUNDACIO DE LA COMUNITAT VALENCIANA SCITO
We use our own and third party cookies for analysing and measuring usage of our website to improve our services. If you continue browsing, we consider accepting its use. You can check our Cookies Policy in which you will also find how to configure your web browser for the use of cookies. More info