Prediction Robustness and Data Redundancy in Machine Learning for Materials Science

Kangming Li^a, Daniel Persaud^a, Kamal Choudhary^b, Brian DeCost^b, Michael Greenwood^c, Jason Hattrick-Simpers^a

^aDepartment of Materials Science and Engineering, University of Toronto, Canada

^bMaterial Measurement Laboratory, National Institute of Standards and Technology, USA

^cCanmet MATERIALS, Natural Resources Canada, Canada

Materials for Sustainable Development Conference (MATSUS)
Proceedings of MATSUS Spring 2024 Conference (MATSUS24)

#AI - Automation and Nanomaterials (machine learning, artificial intelligence, robotics, accelerated discovery)

Barcelona, Spain, 2024 March 4th - 8th

Organizers: Ivan Infante and Oleksandr Voznyy

Invited Speaker, Kangming Li, presentation 151
DOI: https://doi.org/10.29363/nanoge.matsus.2024.151
Publication date: 18th December 2023

The rapid growth of big data in materials science has led to significant advancements in materials property prediction by machine learning (ML) models. However, big data does not necessarily lead to robust prediction performance of ML models. In addition, the issue of information redundancy in materials data has been largely overlook. This talk intends to present an examination of these two correlated challenges related to materials data: prediction robustness and data redundancy.

First, we will discuss the challenges in ensuring the prediction robustness of ML models, by showcasing the severe performance degradation when the models are trained on the Materials Project 2018 dataset and tested on the Materials Project 2021 dataset. We will demonstrate the impact of distribution shifts and use tools such as UMAP and query-by-committee to foresee performance degradation and to improve prediction accuracy. Next, we will delve into the issue of data redundancy across large materials datasets, revealing that up to 95% of materials data can be safely removed with little impact on the model performance. We will highlight the application of uncertainty-based active learning algorithms to create smaller but informative datasets, leading to more efficient data acquisition and ML training. By examining these challenges, this talk aims to provide insights into building more efficient and robust materials databases and ML models for accurate and reliable predictions in materials science.

References:

[1] Li, K., DeCost, B., Choudhary, K. et al. A critical examination of robustness and generalizability of machine learning prediction of materials properties. npj Comput Mater 9, 55 (2023).

[2] Li, K., Persaud, D., Choudhary, K. et al. Exploiting redundancy in large materials datasets for efficient machine learning with less data. Nat Commun 14, 7283 (2023).

Acknowledgements:

The computations were made on the resources provided by the Calcul Quebec, Westgrid, and Compute Ontario consortia in the Digital Research Alliance of Canada (alliancecan.ca), and the Acceleration Consortium (acceleration.utoronto.ca) at the University of Toronto. We acknowledge funding provided by Natural Resources Canada’s Office of Energy Research and Development (OERD).

nanoGe is a prestigious brand of successful science conferences that are developed along the year in different areas of the world since 2009. Our worldwide conferences cover cutting-edge materials topics like perovskite solar cells, photovoltaics, optoelectronics, solar fuel conversion, surface science, catalysis and two-dimensional materials, among many others.

MATSUS

Previously nanoGe Spring Meeting (NSM) and nanoGe Fall Meeting (NFM), MATSUS is a multiple symposia conference focused on a broad set of topics of advanced materials preparation, their fundamental properties, and their applications, in fields such as renewable energy, photovoltaics, lighting, semiconductor quantum dots, 2-D materials synthesis, charge carriers dynamics, microscopy and spectroscopy semiconductors fundamentals, etc.

International Conference on Hybrid and Organic Photovoltaics

International Conference on Hybrid and Organic Photovoltaics (HOPV) is celebrated yearly in May. The main topics are the development, function and modeling of materials and devices for hybrid and organic solar cells. The field is now dominated by perovskite solar cells but also other hybrid technologies, as organic solar cells, quantum dot solar cells, and dye-sensitized solar cells and their integration into devices for photoelectrochemical solar fuel production.

Asia-Pacific International Conference on Perovskite, Organic Photovoltaics and Optoelectronics

The main topics of the Asia-Pacific International Conference on Perovskite, Organic Photovoltaics and Optoelectronics (IPEROP) are discussed every year in Asia-Pacific for gathering the recent advances in the fields of material preparation, modeling and fabrication of perovskite and hybrid and organic materials. Photovoltaic devices are analyzed from fundamental physics and materials properties to a broad set of applications. The conference also covers the developments of perovskite optoelectronics, including light-emitting diodes, lasers, optical devices, nanophotonics, nonlinear optical properties, colloidal nanostructures, photophysics and light-matter coupling.

International Conference on Perovskite Thin Film Photovoltaics Perovskite Photonics and Optoelectronics

The International Conference on Perovskite Thin Film Photovoltaics Perovskite Photonics and Optoelectronics (NIPHO) is the best place to hear the latest developments in perovskite solar cells as well as on recent advances in the fields of perovskite light-emitting diodes, lasers, optical devices, nanophotonics, nonlinear optical properties, colloidal nanostructures, photophysics and light-matter coupling.