Journal of Agricultural Meteorology
Online ISSN : 1881-0136
Print ISSN : 0021-8588
ISSN-L : 0021-8588
Full Paper
Toward improving global rice yield reference dataset compilation through machine learning: Insights from training data selection and random forest analysis
Yoshimitsu MASAKIToshichika IIZUMIToru SAKAIKei OYOSHI
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML
Supplementary material

2025 Volume 81 Issue 2 Pages 90-105

Details
Abstract

 Machine learning (ML) techniques have been increasingly used to estimate crop yields at scales ranging from on-site to global. Since ML techniques are data-driven approaches, it is empirically known that the performance of a specific ML algorithm depends on the manner in which the training dataset is compiled. However, few studies have quantitatively evaluated the performance. In this study, global rice yields were estimated through a random forest (RF) methodology. Performance dependency of RF on training data was examined by a comparison of estimated yields using different training datasets covering different yield ranges and geographical extents. First, 14 explanatory variables collected from different sources (satellite vegetation, meteorology, and geographical location data) were used for building RF regressors. The crop calendar was determined from a combination of satellite vegetation and crop model simulation. Next, RF regressors were trained to give census-based rice yields (used as reference yields) from training datasets of the 14 explanatory variables. By applying the RF regressors to validation datasets, misfits between estimated and the reference yields were evaluated. RF reproduced rice yields, but the accuracy depended on the training data. Yields beyond the yield range of the training data could not be reproduced by RF. This indicates that the yield range of the training data determined the possible range of estimated yield. Among the 14 variables, geographical coordinates (longitude and latitude) ranked the highest importance, i.e., played a crucial role in estimating yields. The RF regressors built from the 14 variables outperformed those built only from the geographical coordinates in accuracy but with limited advantage. We concluded that (1) choosing training data to cover all possible yield ranges of the target rice-cropping areas was crucial for accurate yield estimation using RF and (2) incorporating satellite and simulation data was advantageous for building high-performance RF regressors.

Content from these authors
© Author (s).

This article is licensed under a Creative Commons [Attribution 4.0 International] license.
http://creativecommons.org/licenses/by/4.0/
Previous article Next article
feedback
Top