2025 Volume 81 Issue 2 Pages 112-116
Field experiment data or crop observations at sites reported in agronomic literature are of high quality and have been considered as a potential source of information for the development of a global grid crop dataset. However, extracting data on a crop variable of interest from the text and tables of many papers is a time-consuming, painstaking task for dataset developers. Recent advances in large language models (LLMs) and resulting tools are expected to provide a promising solution. This study presents a computational method for extracting data from research papers using an LLM-based online tool, ChatPDF. The Python program we developed is applied to the 164 papers to extract crop phenology data of maize, soybean, wheat and rice for demonstration purposes. The results show that the LLM-based data extraction method can dramatically reduce the burden of data extraction in human curation, but needs improvement to become a reliable alternative that can replace manual data extraction. In particular, innovations are needed to increase the capture rate by avoiding data omissions and to reduce errors by correctly inferring longitudes, latitudes and harvesting years. The LLM-based data extraction is currently in its infancy and deserves future research for large-scale implementation.