2025 Volume 81 Issue 2 Pages 90-105
Machine learning (ML) techniques have been increasingly used to estimate crop yields at scales ranging from on-site to global. Since ML techniques are data-driven approaches, it is empirically known that the performance of a specific ML algorithm depends on the manner in which the training dataset is compiled. However, few studies have quantitatively evaluated the performance. In this study, global rice yields were estimated through a random forest (RF) methodology. Performance dependency of RF on training data was examined by a comparison of estimated yields using different training datasets covering different yield ranges and geographical extents. First, 14 explanatory variables collected from different sources (satellite vegetation, meteorology, and geographical location data) were used for building RF regressors. The crop calendar was determined from a combination of satellite vegetation and crop model simulation. Next, RF regressors were trained to give census-based rice yields (used as reference yields) from training datasets of the 14 explanatory variables. By applying the RF regressors to validation datasets, misfits between estimated and the reference yields were evaluated. RF reproduced rice yields, but the accuracy depended on the training data. Yields beyond the yield range of the training data could not be reproduced by RF. This indicates that the yield range of the training data determined the possible range of estimated yield. Among the 14 variables, geographical coordinates (longitude and latitude) ranked the highest importance, i.e., played a crucial role in estimating yields. The RF regressors built from the 14 variables outperformed those built only from the geographical coordinates in accuracy but with limited advantage. We concluded that (1) choosing training data to cover all possible yield ranges of the target rice-cropping areas was crucial for accurate yield estimation using RF and (2) incorporating satellite and simulation data was advantageous for building high-performance RF regressors.