Proceedings of MATSUS Spring 2024 Conference (MATSUS24)
DOI: https://doi.org/10.29363/nanoge.matsus.2024.151
Publication date: 18th December 2023
The rapid growth of big data in materials science has led to significant advancements in materials property prediction by machine learning (ML) models. However, big data does not necessarily lead to robust prediction performance of ML models. In addition, the issue of information redundancy in materials data has been largely overlook. This talk intends to present an examination of these two correlated challenges related to materials data: prediction robustness and data redundancy.
First, we will discuss the challenges in ensuring the prediction robustness of ML models, by showcasing the severe performance degradation when the models are trained on the Materials Project 2018 dataset and tested on the Materials Project 2021 dataset. We will demonstrate the impact of distribution shifts and use tools such as UMAP and query-by-committee to foresee performance degradation and to improve prediction accuracy. Next, we will delve into the issue of data redundancy across large materials datasets, revealing that up to 95% of materials data can be safely removed with little impact on the model performance. We will highlight the application of uncertainty-based active learning algorithms to create smaller but informative datasets, leading to more efficient data acquisition and ML training. By examining these challenges, this talk aims to provide insights into building more efficient and robust materials databases and ML models for accurate and reliable predictions in materials science.
The computations were made on the resources provided by the Calcul Quebec, Westgrid, and Compute Ontario consortia in the Digital Research Alliance of Canada (alliancecan.ca), and the Acceleration Consortium (acceleration.utoronto.ca) at the University of Toronto. We acknowledge funding provided by Natural Resources Canada’s Office of Energy Research and Development (OERD).