近年来,信息学与新材料相结合的科学发现如雨后春笋般涌现。机器学习方法在这些研究中扮演了关键角色。一般而言,在保证数据质量的前提下,训练数据集越大,训练后的机器学习模型越精确。对于深度神经网络来说尤其如此,当使用大量的数据进行训练时,神经网络具有卓越的预测性能。因此,一些加速数据积累的方法,如高通量计算和高通量实验,已经被发展用来建立大型数据库。然而,在许多材料研究中,特别是对于新材料,我们仍然面临着缺乏高质量数据来训练可靠的机器学习模型的困境。主要的障碍来自于收集实验数据(真实数据)的过程困难且耗时。虽然计算数据的成本可能比实验数据低,但在材料科学的许多应用中,这两类数据之间仍然存在很大的差距。
来自北京科技大学的班晓娟教授和黄海友研究员等人开发了一种新的基于迁移学习的数据增广策略来解决材料数据挖掘中小数据或数据不足的困境,这种策略实现了计算模拟数据和实验数据(真实数据)的转换和融合,成功扩充了训练数据,仅根据一小批实验数据就可以建立性能更好的机器学习模型。在材料科学研究中,模拟计算是一种高效的数据采集方法。但由于模拟数据和真实数据存在数据分布差异,仅将模拟数据混合到真实数据中,可能对机器学习模型产生负面的影响。该研究提出利用生成对抗网络减少域间差异,以纯铁晶粒的语义分割任务为例,通过模拟仿真模型获得大量模拟图像,同时使用部分真实图像构建和训练风格迁移网络模型,再实现模拟图像到真实图像的转换,最终生成大量的具有模拟图像晶粒结构和真实图像纹理信息的合成图像,成功提高了机器学习模型的预测性能,降低了对大量真实数据的依赖。
该文近期发表于npj Computational Materials 6: 125 (2020),英文标题与摘要如下,点击https://www.nature.com/articles/s41524-020-00392-6可以自由获取论文PDF。
Data augmentation in microscopic images for material data mining
Boyuan Ma, Xiaoyan Wei, Chuni Liu, Xiaojuan Ban, Haiyou Huang, Hao Wang, Weihua Xue, Stephen Wu, Mingfei Gao, Qing Shen, Michele Mukeshimana, Adnan Omer Abuassba, Haokai Shen & Yanjing Su
Recent progress in material data mining has been driven by high-capacity models trained on large datasets. However, collecting experimental data (real data) has been extremely costly owing to the amount of human effort and expertise required. Here, we develop a novel transfer learning strategy to address problems of small or insufficient data. This strategy realizes the fusion of real and simulated data and the augmentation of training data in a data mining procedure. For a specific task of grain instance image segmentation, this strategy aims to generate synthetic data by fusing the images obtained from simulating the physical mechanism of grain formation and the “image style” information in real images. The results show that the model trained with the acquired synthetic data and only 35% of the real data can already achieve competitive segmentation performance of a model trained on all of the real data. Because the time required to perform grain simulation and to generate synthetic data are almost negligible as compared to the effort for obtaining real data, our proposed strategy is able to exploit the strong prediction power of deep learning without significantly increasing the experimental burden of training data preparation.
特别声明:以上内容(如有图片或视频亦包括在内)为自媒体平台“网易号”用户上传并发布,本平台仅提供信息存储服务。
Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.