Introduction

The environmental impacts of conventional agrifood systems and markets that shape modern diets worldwide are widely acknowledged1,2,3, particularly in terms of biodiversity loss4. At a global scale, crop systems seem to rely on a poor diversity of species3,5. According to the most recent Food and Agriculture Organization of the United Nations (FAO) assessment6, out of 6000 different plant species cultivated in 2014, only 9 accounted for 66% of total crop production, and only 3 of them—wheat, rice, and maize—represented more than 50% of plant-based human food. This low level of species diversity of cropping systems worldwide makes crops much more sensitive to pest infestation7,8, less resilient to climate change9,10,11,12, and more dependent on synthetic inputs such as fertilizers and pesticides13,14. This situation may also have detrimental consequences on the human diet3,15. Agroecological transition toward greater species diversity and food transition toward more plant-based diets are both required12,16,17, but the related drivers are obviously complex and multidimensional and concern technological, social, institutional, educational, political, and economic levers simultaneously18,19.

Concerning packaged food markets, additional data on food product composition are required to uncover the relationship between species diversity and agrifood systems2,16,20. However, this situation is hampered, on the one hand, by a lack of information in the scientific literature studying the types of ingredients used in food products21 and, on the other hand, by a lack of consensus on how to categorize ingredients, notably according to their levels of processing22, beyond the first step of identifying which species are used by the food industry. Our research addresses these challenging questions by using text-mining methods applied to the ingredient lists of packaged branded foods, regardless of the degree of processing.

Our goal is to question the diversity of species from a market point of view: what crop diversity is used by the food industry throughout the world? To address this question, we focus on food product launches that reflect the innovation dynamics of branded food markets. We assume that the food ingredients in these new products reflect food industry know-how on species usage. According to innovation theories23 and transition studies24,25, a transition toward more diverse cultivated crops can result from firms investing in new species for their products. In short, we consider the new packaged/processed food offered to be a proxy of the capacity of the food industry to support species diversity.

Given that the mention of the geographical origin of the ingredients used in food products is not mandatory and that international trade exchanges regarding agricultural and food products are becoming increasingly important, we first address this question at the global scale. A deeper analysis to assess this diversity at the regional or country scale by comparing the level of crop diversity observed in food products with that of the main cultivation systems is beyond the scope of this study. The challenge here is to analyze, first, the variety of species through the ingredients used. This diversity of species is one major component of biodiversity issues.

This level of diversity is assessed at the ingredient list level from the following three complementary perspectives: (a) the analysis of the variety of species used, (b) the degree of concentration in the market on certain dominant species, and (c) a preliminary approach to species’ contribution to the overall product formulation. These three approaches to determining species diversity in product launches are based on three assumptions. First, the term “diversity” suggests a wide variety of species. If markets feature only a few species, then it would be speculative to speak of diversity. Second, the predominance of a particular species can be a disabling factor for increased levels of cultivated diversity. To assess these effects, we need to look at how food processing markets concentrate on certain species. Third, we hypothesize that the ways in which a species is used by the food industry can either promote or limit species diversity in the market.

Therefore, this paper contributes to the need for better scientific knowledge on the composition of those food products26,27 thanks to the development of text-mining methods.

Given the important methodological issues associated with such assessments, we first focus on one group of species under greater societal debate—pulses. Pulses are increasingly promoted as a main lever for both agroecological and food transition28,29 from the “one health” perspective. Pulses have benefited from growing public support since the 2016 International Year of the United Nations, which has created great momentum for these plant species30. While pulses are increasingly praised for their contribution to healthy and sustainable agrifood systems, they face a lock-in situation compared to the considerable degree of development of major crops such as wheat and soy31,32,33. In particular, debates exist on how pulses can develop compared with soy, which is not only already a dominant commodity for feed but also is becoming a dominant commodity for food. Indeed, bibliometric works have shown that more than half of the scientific works on legumes in the food sciences field are on soy34,35. Considering that the development of pulses is a major challenge for food28,33, we choose to compare the diversity of pulse species with that of soy in branded package foods at a global scale.

The remainder of this paper is organized as follows. Based on several statistical figures and tables, the Results section discusses the main insights. The Discussion section enhances the reflection by proposing future works for analyzing the type of processing of those ingredients and hence providing a comprehensive picture of the food technological development of pulses in food markets. We also question how to better organize data and information on food products to further assess the diversity of food species at the country scale and to link our analysis of the food industry supply with the analysis of consumer purchases. The last section (Methods) presents the data retrieved from the Mintel GNPD database and how we parsed the ingredients that were analyzed with food experts to identify the species.

Results

Species diversity in an unbalanced market

The tagging of legume-based ingredients from the dataset led to the identification of 32 different species, including soy (all presented in Table S1 in the Supplementary Materials section and in Table 1 for the first 10 species).

Table 1 First 10 species frequencies in branded food product launches

The number of species identified is a rather unexpected observation and, at first glance, can be considered an encouraging result in terms of species diversity. However, we observe a highly asymmetrical distribution of the identified species. Products for which only soy is identified within the ingredient lists account for 73% (n = 249,425) of all products (n = 343,309). Conversely, products for which only one or more NSL (non-soy legume) ingredients are identified represent 19% (n = 65,653) of all products. Eight percent (n = 28,231) of the remaining products contain both soy and NSL ingredients. Therefore, nearly 4 times more product launches contain soy ingredients than contain NSL ingredients.

The analysis of the frequencies of these species (Table 1 and Table S1) according to their position confirms this highly unbalanced market in favor of a few species—primarily soy and pea. More precisely, a quartet of species, namely, Pisum sativum L., Phaseolus vulgaris L., Cicer arietinum L., and Lens culinaris Medik L. account for almost 79% of NSL ingredients, while approximately twenty NSL species have a frequency of appearance inferior to 1% among products containing NSL species. However, products with NSL ingredients are more frequently associated with the mention of NSL species in the product description (on the packaging) compared with products with soy ingredients, for which this frequency is only 4%. For instance, 70% of products contain Lens culinaris Medik. present a mention of the lens species within the product description.

This imbalance between soy and NSL ingredients can be linked to economic factors, such as the availability of each species for food companies. According to the FAOStat, soy remains the most cultivated legume in the world, with an annual production of more than 300 million tons over the last decade. This production level is 3 times greater than the total production of the above-mentioned most frequent NSL species. These observations first confirm the existence of a strong “technological lock-in” around one legume species, soy, which is also widely used for food worldwide, and the dominance of this species can be analyzed as a structural trend in agrifood markets29,32.

This dominance of soy is observed across all market segments (Fig. 1), except the “spreads” and “fruits and vegetables” segments, where the balance between products containing soy ingredients and products containing NSL ingredients is almost negligible or even reversed, respectively. However, the growth rates of products containing soy and NSL ingredients lead us to several observations. Although soy dominates in terms of volume, the cumulative growth rate of products containing NSL ingredients is much greater than that of products containing soy, regardless of the market segment considered (see Table S2 in Supplementary Materials). Some products containing NSL ingredients experience a very high cumulative growth rate, particularly in the “dairy” segment, which is almost 12 times greater than that of soy-based products. The “desserts” segment is almost 9 times greater, and the “breakfast” segment is 7 times greater. More generally, these observations point to the growing interest among food companies in NSL ingredients31. Such growing interest, if confirmed over time, can favor a greater diversity of the legume species used.

Fig. 1: Shares of products containing soy-based ingredients or NSL-based ingredients in each market segment (%).
figure 1

The percentage sum may exceed 100% because some products have both soy- and NSL-based ingredients (see Fig. 3). The color intensity reflects the cumulative growth of product launches in the market segment for soy and NSL ingredients over the decade. The market segment categories are those established in the Mintel-GNPD database and detailed in another work32.

Europe and southern Asia present a less unbalanced market between soy and NSL ingredients

The differences between products containing soy ingredients and those containing NSL ingredients are also more significant when we observe the share of these two categories in each main geographical area of our corpus (Fig. 2).

Fig. 2: Soy- and NSL-based products in the main geographic areas covered by the corpus.
figure 2

The list of the countries covered in each geographic area is given in the Supplementary Materials (Table S4).

Soy ingredients are dominant in every geographical area compared to NSL ingredients. Nevertheless, some of these areas show a more balanced picture. This is particularly true for Europe and southern Asia, where approximately 40% of product launches contain NSL ingredients, showing a more balanced market between soy- and NSL-based products. Conversely, all other geographical areas show a much more asymmetrical market, largely in favor of soy-based products compared to NSL-based products. Thus, the share of products containing soy ingredients in the North and South American markets, as in Eastern and Southeastern American markets, is extremely dominant (approximately 80%). In the other geographical areas such as smaller and less well-documented markets in the Mintel database (which is notably the case for Africa), the share of packaged food products containing soy, although lower (over 60% and under 73%), remains dominant.

The interpretation of such differences across geographical areas is probably multifactorial in nature. The structuring of the different markets may reflect differences in terms of food culture. For instance, soy products dominate Asian markets, except the southern Asian market, which includes India, a country where pulse (and particularly lentil) production and consumption levels are among the highest in the world36,37. Moreover, such differences can also be interpreted as the consequences of different national or international public support schemes for pulse consumption, as is the case for Europe38. Nevertheless, regarding Europe, this quasi-balance between soy and NSL food products can indicate a shift in the technological lock-in that European countries have encountered until now31,32,33, which is beneficial for greater crop diversity in the processed food supply in the near future.

More generally, the overall structure of the corpus, whether in terms of market segments or geographical areas, reveals that a small number of species account for the bulk of legume-based packaged food innovations (i.e., new products or launches). This concentration stands in the way of greater diversity in processed food markets. The more we use a small number of species to produce a larger and growing variety of foods (soy is present in all market segments), the less room there is for the development of other species. This situation, partly resulting from historical and economic factors that lead to a lock-in situation, can undergo contemporary changes. However, to confirm an actual possible shift that would favor the use of NSL species in food offerings, we also need to look at the ways in which these species are used in product formulations.

Product-context use of legumes: entering the importance of the ingredient

To further assess the diversity of those legume species used in product launches, we examine the “product-context of use” of these species. In terms of diversity, we assume that it can be misleading to consider, at the same level, the food products that use these species for different reasons about which we do not know. Notably, from the point of view of product formulation, we may consider that the functional properties (technological, organoleptic, nutritional, etc.) derived from the parts of the species used account for more than the species itself. In this way, we propose to approach what we call the “product-context of use” by jointly analyzing the different positions of appearance of those species in ingredient lists.

A good starting point is the examination of where the identified species appear in ingredient lists. Regulations require ingredients to be listed in descending order of importance, with the first ingredient weighing the most and the last ingredient weighing the least. Hence, we assume that a species that is used only for a few of its functionalities (for instance, a treatment process aimed at extracting one or more of its parts, such as peptides, starches, and gelling compounds) is more likely to be found among the least important ingredients of an ingredient list (i.e., those weighing the least). This approach can be further refined by assessing whether or not the species identified in food products are part of the marketing pitch. We assume that the mention of the species on product packaging (in addition to its appearance in the ingredient list) gives higher specificity to the species used, as it is positively associated with the identity of the product. From this point of view, the differences between soy and NSL ingredients are quite striking.

Table 2 reports the mean position of soy- and NSL-based ingredients according to the ingredient list length, grouped in deciles. We observe that half of all the products (52%) containing NSL ingredients are concentrated within the first four deciles; the first decile accounts for almost 20% of the products containing NSL ingredients. For products containing soy ingredients, this threshold is reached from the 6th decile upward.

Table 2 Soy-based and NSL-based ingredients mean position according to ingredient list length

More generally, soy-based ingredients tend to appear more frequently in food products with complex formulations (i.e., longer ingredient lists) and almost systematically at a higher rank (column “Soy ingr. mean position” in Table 2) than NSL-based ingredients. In all the deciles except for the last three, the mean position of NSL ingredients is always lower than that of soy ingredients. This finding means that NSL ingredients tend to appear more at the top of ingredient lists compared to soy ingredients, suggesting that the amount of the former used in the product formulations is probably greater than that of the latter used. This result can be explained by the fact that soy cracking has been much more widely studied than pulses in the field of food sciences and technology, particularly during the last decade34. Research and development in this ___domain have led to a broader knowledge base for the various uses and functionalities of soy in comparison with other pulses/NSL. In view of this, our results can confirm that soy use is associated with a larger array of functional ingredients than NSL use. The likelihood of finding soy ingredients for use as additives in product formulations is likely greater than that for NSL ingredients, whose position is most often among the first ingredients on ingredient lists (Fig. 3).

Based on these new criteria, we refine our analysis by classifying species according to their frequency of appearance among the first (i.e., top) five ingredients. Figure 4 presents the results of this classification, which is carried out using k-means clustering (see the Methods section) and interpreted by the food science experts involved in the analysis. The 5 resulting groups are identified by categorical colors, and species are displayed in a three-dimensional space showing their frequency of appearance (as a percentage) among the first 5 ingredients, the remaining ingredients, and on the product packaging. In addition, to help interpret the results of this clustering method, we also examine the most frequent ingredient expressions associated with the species in each group.

Fig. 3: Sankey diagram of the dataset according to the position of soy or NSL ingredients (n = 343,344 food products).
figure 3

The dataset consists of the following 3 main subcorpora: products containing soy ingredients, products containing NSL ingredients, and products containing both types of ingredients. For each corpus, the position of the ingredients in the list is provided—either among the first 5 first ingredients or among the remaining ingredients. For the small portion of products having both NSL and soy ingredients, their position at the top or in the remaining list can differ, resulting in four flows from this node.

Fig. 4: 3D scatterplot of the most frequent legume species.
figure 4

Each species is plotted in a 3D graph according to its frequency of appearance in the first five ingredients (InFirst%), in the remaining ingredients (InRemList%), and in the product description (InDesc%). Each color represents a cluster resulting from k-means clustering (see the Methods section for the clusters requested according to results provided by the silhouette coefficients method44).

In Fig. 4, a central axis structures the cluster distribution. This axis distinguishes the species mostly found among the first 5 ingredients and frequently in product descriptions (green cluster) from those found more frequently among the remaining ingredients and rarely cited in product descriptions (blue cluster). More precisely, at one end of this axis, we find a set of 6 species highlighted in green, namely, Phaseolus coccineus L., Cajanus cajan L., Lens culinaris Medik., Phaseolus vulgaris L., Cicer arietinum L., and Vicia faba L., characterized by a high frequency of appearance among the first ingredients and a high rate of mentions in product descriptions. These features lead us to suggest that product identity is more closely associated with those NSL species, regardless of their functional use, than with other NSL species. In that sense, species from this group can have a more positive impact on market biodiversity, as they are of key interest for the food industry, in comparison to species used only in terms of functional interest, and therefore can be substituted by other species. The most frequent ingredients associated with species from this group do not seem to indicate their fractional use. For example, in the case of Lens culinaris Medik., the most common ingredients mentioning this species are listed directly by its vernacular name, without mentioning specific parts (“lentils”, n = 2522; “red lentils”, n = 1250; and “green lentils”, n = 818). When this ingredient is associated with a processing term, the most frequent term is milling (“lentil flour”, n = 1195). The same is true for Cicer arietinum L. (“chickpeas”, n = 7467, and “chickpea flour”, n = 3295). This cluster gathers ingredients that seldom undergo processing.

At the opposite end of this central axis, plotted in blue, we find a group of 3 NSL species (Canavalia gladiata Jacq., Pachyrhizus erosus L., Dolichos lablab L..) and soy (Glycine max L.). They present the following opposite profile compared to that mentioned above: a low level of frequency among the first ingredients and in product descriptions. This cluster can also include Ceratonia siliqua L. (plotted in purple), which has been identified as a cluster in its own right due to its extreme behavior—it is hardly ever –mentioned neither in product descriptions nor among the first ingredients. In this group, for the two most frequent species, Glycine max L. and Ceratonia siliqua L., the most frequently associated ingredients correspond to fractional uses (“soy oil”, n = 58,584; “soy lecithin”, n = 50,490; “soy protein”, n = 18,569; “locust bean gum”, n = 4903; and “carob bean gum”, n = 2242).

This axis, which contrasts species according to their frequency of appearance (within the first 5 ingredients and in the product description), may also tend to oppose different product contexts of use of species, thus bringing us back to our initial hypothesis: the more frequently that a species is used in a fractional way, the more likely it is to be found among ingredients of lesser importance (in terms of volume and therefore rank and level) in ingredient lists and the less prominence it is given in product packaging.

Hence, the case of the median cluster (plotted in red in Fig. 4) is very interesting. Here, we find species characterized by a balanced score between their frequency of appearance among the first and remaining ingredients but not systematically mentioned in product descriptions (Pisum sativum L., Vigna unguiculata L., Vigna angularis L., Vigna radiata L., and Lupinus angustifolius L.). According to our main hypothesis, this median position between species can reveal various strategies of the food industry for those species that can become more “identical” or for those “generic” species, according to the future uses that such species will encounter. In other words, most of the time, these species can be used as effective key components for product formulation. The analysis of the most frequent ingredients quoting the most major species of this group seems to substantiate this observation. For Pisum sativum L., the two most common ingredients are “peas” (n = 10,132) and “pea protein” (n = 6423), and for Lupinus angustifolius L., the most common ingredient is lupin flour (n = 1095).

Finally, a group of three species (plotted in orange in Fig. 4), made up of Vigna aconitifolia Jacq., Vigna mungo L., and Phaseolus acutifolius L., seems to be opposed to the first group described (plotted in green) due to the weaker propensity of these species to be cited in product descriptions. These “discrete” species have a very low frequency of appearance, but the analysis of the most cited related ingredients brings them closer to the first group. For example, the most common ingredients referring to Vigna mungo L. mention the species by its vernacular name (“black gram lentils”, n = 1412, and “black lentils”, n = 114), and when a process is mentioned, in most cases, it concerns flour, the resulting product from grinding possibly coupled with sieving (“black lentil flour”, n = 56). We observe the same phenomenon for Phaseolus acutifolius L. (“tepary bean flour”, n = 128, and “tepary beans”, n = 25).

Discussion

Although there is abundant and growing scientific literature on the benefits of biodiversity for sustainable agrifood systems, this paper constitutes the first attempt to assess species diversity in the food market through the processed/packaged food products launched in the main regions of the world. We develop original text-mining analysis methods for tagging species and interpreting species diversity through the food ingredients used by the industry. We base this approach on the different positions of the species under study within ingredient lists and from additional descriptions on product packaging.

With a focus on legume species, a botanical family at the heart of sustainability issues in contemporary agrifood systems, this work compares the development of soy in processed food markets with the development of NSL. By analyzing the ingredient compositions of approximately 350,000 food products launched worldwide, we identify more than 30 different NSL species used in product formulation, which suggests consistent species diversity. Nevertheless, we mitigate this issue by taking an in-depth look at the frequency of appearance and ways in which such species are used in product formulation and then promoted on packaging. Through this first analysis, we can assess to what extent the industrial food market is concentrating on certain dominant species, which can hamper the development of species diversity. Indeed, one of our main hypotheses regarding path dependency is that the more a market locks in a certain species, the more difficult it becomes for other species to emerge, and this lock-in situation becomes a major hindrance to increased species diversity. We thereby confirm, at the food market level, what other studies have confirmed at the crop field level: there is still a strong lock-in situation around soy-based products, despite the perceptible shift in favor of other pulses. In this regard, our results particularly highlight the specificity of the European market, which presents a more balanced use of soy and NSL ingredients.

Furthermore, we assume that the position in ingredient lists of ingredients related to soy or NSL is a proxy of the ways in which these species are used. Considering this position (in the ingredient list) is a way to distinguish species that are more valued in their entirety as opposed to more “flexible” species used for their ability to be decomposed into various functional ingredients. By combining this information with the ways in which species are promoted or not promoted on packaging, we suggest that the notion of species and, therefore, of biodiversity, tends to be replaced by that of functionality. The food industry’s lack of interest in highlighting the species simply as it is, and before any processing, reflects an agrifood system where the production is increasingly valued by compositional/processed foods, a paradigm where food is first considered a technological process of assembling elements fulfilling a desired goal such as increasing protein level, improving texture, avoiding specific flavors, etc16. If this work constitutes a first step toward a deeper analysis of what we call “the product-context use” of species in food products, then further research should undertake a deeper assessment of the processing profile of such ingredients.

This work also demonstrates the interest and feasibility of analyzing the agrifood market supply at the ingredient scale. While some other works studying the composition of food products have addressed food safety21 and climate issues39, here, we focus on species diversity issues. Identifying how processed food impacts agrifood systems remains a key challenge for informing public policymakers and consumers. We argue that this type of analysis, at the ingredient level, can serve as a tool for public policy to steer agrifood markets toward more sustainable goals. However, this situation depends on the availability of food databases and accessible controlled vocabularies and ontologies. Except for the USDA-branded food database and the crowd-sourced Open Food Facts database, the most extensive food databases required to conduct such an analysis are privately owned. Despite the intrinsic quality of these databases, whether for research purposes or to support public policy, this situation does not guarantee the control and transparency of information from collection to data processing. In this context, benefiting from controlled vocabularies validated by the academic community on the ingredients linked to processed foods can not only ensure transparency but also constitute a powerful tool for public policy40. We also advocate for the need for enriched public databases on the processed food supply and for future works to confront this food supply with consumer purchases to analyze the market failure of species diversity. Furthermore, this work reveals several additional perspectives that we briefly discuss hereafter.

Firstly, identifying the species used by the food industry is of particular importance because agricultural diversity is a main lever for sustainability, and food outlets remain the main driver of cultivation choices by farmers. Beyond this aim of species diversity assessment, text-mining methods for ingredient lists (and any other information from packaging) allow us to follow the market development of any crop species, particularly those that are expected to be developed, such as pulses. Future work can further analyze the processing approach applied to crops to identify the ingredients and determine the types of processed food products under development. What a firm offers on the market reveals its know-how and reflects the technological paradigm that prevails in a given sector. Hence, this information can also be useful for science and innovation policies in the agrifood sector to support preferred technological paradigms that meet societal expectations regarding, for instance, less processed or minimally processed food with “cleaner” formulations. This work also highlights the need to develop a common vocabulary or ontology that favors transparency and contributes to debates about food classifications. For instance, a clear classification can allow for a better distinction between plant- or chemical-derived ingredients. The outcome of this work constitutes a case study that can be integrated into ontology development, such as TransformON or FoodON, which are interconnected databases in the agrifood system40,41.

Secondly, a more complete picture requires further work. First, the Mintel database’s coverage of countries is not complete. Second, we need to look at all species and consider other food supplies, including the indigenous supply, which particularly concerns less developed countries compared with the processed/packaged food supplies provided in most modern countries. However, the availability of these data is challenging. Concerning processed/packaged food, we highlight several impediments, such as the absence of global regulation on the ways in which to name most ingredients that complicate the analysis and call for the development of specific dictionaries and ontologies. This is also true for indigenous supplies, but in addition, qualitative inquiries are required because no database exists for more traditional food procurement. In addition, the issue of the importance of the ingredients in the product and, more largely, in the supply (each product also having various weights) remains in question. The position of an ingredient among other ingredients in a list is a way to inform on this “importance” but requires complex calculus to assess this biodiversity if we aim to consider all species. Such a complete picture requires the use of other metrics, such as the Herfindahl index or similar metrics.

Thirdly, one main hypothesis of our study is that the market incentivizes farmers to make certain cultivation choices. Therefore, rarely used species are rarely cultivated. This coevolution process creates a path dependence that favors the development of major crops. If globally assessing the species diversity used in the industry provides first-order information to confirm this path dependence, then looking at this issue within each country can inform policymakers regarding supporting new crops for more diversified agroecosystems around major species that may differ across countries. The issues related to the diversification of crop rotations differ across regions. However, to further explore this issue, we need to know the origin of ingredients (at least for the first, more important, ingredients) to compare them with the geographical places where they are retailed. However, the mention of the geographical origin of the main ingredients is not mandatory. Geographical information about products usually concerns the headquarters of the retailing firm and not the origin of the ingredients themselves.

Finally, another challenge concerns how to confront the food supply provided by the industry considering the purchases of consumers and even the different diets of consumers (including catering and restaurants, for instance). This issue also requires further methodological work, as there is no easy solution to compare launch data with purchase data. This is likely why the literature is unclear regarding the failure rate of new products launched on the market42. When studies focus on consumer purchases in retail channels with, for instance, Kantar data, it is not known which products available in retail channels are not purchased, and thus, we do not know which new products from the food industry (and, notably, those products that can meet societal challenges) are not chosen by consumers. Those failed products then exit the market after several months and are thus not studied. However, it is interesting to understand the profile of those products and to investigate the reasons why they do not meet consumer demand. Price can be one reason for this difference, but probably not the only reason. To confront purchase and launch data, one solution is to merge Kantar and Mintel product data according to the same identifier, for instance, product barcodes. Nevertheless, this approach is currently not possible because the Kantar data do not include barcodes and, above all, because the regulation of barcodes allows the reuse of a barcode already used for a previous product that disappeared from the market, bringing about the increased risk of error. Therefore, another solution is to use text-mining methods to determine the correspondence between products at the product description level or to compare the ingredient lists. These text-mining methods also allow us to follow over time the changes in the ingredient list of a branded product that maintains a similar name or product description.

All these perspectives suggest that text-mining methods open a new research agenda for furthering the understanding of the evolution of food markets, particularly for following the evolution of crop species in foods.

Methods

Data sourcing

The data are retrieved from Mintel’s Global New Products Database (GNPD), which tracks food product launches in more than 80 countries. The Mintel-GNPD database features approximately 7 million registered branded products (on average, 40,000 product descriptions have been added each month in recent years). The Mintel-GNPD database is currently the only database that provides such global coverage of packaged/processed food offerings, with detailed information at the product level43. The map in the Supplementary Materials section provides an overview of the geographical coverage of the Mintel database due to its global shopper network (Fig. S1). With emergent and developing countries, Western countries have benefited from the large coverage of the Mintel database, which has progressively enlarged, mainly since 2010.

By comparison, the Open Food Facts database lists only approximately 2,800,000 branded products, and the United States Department of Agriculture (USDA) Global Branded Food Products database lists more than 3,500,000 products. However, these two databases are nurtured by volunteer declarations, mainly from consumers or industry firms, and involve a few countries. The Mintel-GNPD database relies on a census organized by shoppers hired in each country of their coverage to identify any new product launched on the market and to report the full information present on the product packaging. In this way, Mintel data provide a picture of the packaged food supply based on what products are provided by the food industry in markets worldwide. Our objective is to assess how the food industry is moving toward greater pulse diversity, and we choose to study food launches rather than food purchases. In addition, confronting the offer and demand data leads to unsolved methodological problems, which we discuss in the last section.

Data are retrieved from the Mintel database via queries targeting legume-related products (legume, pulse, bean, pea, lentil, chickpea, etc.) in all food and drink product categories (except alcoholic beverages and mineral or source waters) and concerning launches from 2010 to 2021 (see screen capture of the query built in Mintel search engine provided in the Supplementary Materials, Fig. S2).

We keep approximately 350,000 products launched during the period 2010–2021 worldwide, with at least one pulse species or soy ingredient. The data are extracted in the form of a tabulated file gathering all the metadata available for each food product, such as the ingredient lists, production and marketing locations, types of product launches (new product, range extension, reformulation, repackaging or relaunch), targeted markets, and claims mentioned.

Data parsing

As ingredient lists are provided in the form of variable-length character strings, which are, by definition, difficult to manipulate and exploit by computational methods, we specifically design a parsing algorithm for this work. This algorithm, which is scripted in the Python language, is detailed in Salord et al.43 and allows for the transformation of ingredient lists extracted from the Mintel database into structured and standardized ingredient dictionaries (in json format). Furthermore, this algorithm relies on a set of standardization rules based on the empirical observation of variations in the syntactic notations of these ingredient lists.

These normalization rules are translated within the algorithm in the form of regular expressions, thereby making the procedure inexpensive in terms of computing resources. Furthermore, raw ingredient lists can be transformed into ingredient dictionaries without any loss of information. Hence, we keep key features such as the rank of the ingredient within the ingredient list (being the first, second, etc., ingredient in the ingredient list) and the depth of appearance (being or not being in brackets) as ingredients of another ingredient. A depth of 1 indicates that the ingredient is not mentioned as being an ingredient of another ingredient. The following insert, Fig. 5, illustrates such considerations.

Fig. 5: Example of a parsed list of ingredients.
figure 5

Illustration of the output dictionaries obtained by the parser43. Each ingredient has an identifier indicating its order of appearance. Information such as the initial name of the ingredient, the depth to which it appears in the list of ingredients (the ingredient of an ingredient), the comment(s) associated with this ingredient, and any proportional information is retained.

For our analysis, we consider legume-based ingredients of level-1 depth, and we distinguish those among the first five ingredients (we called the “In Top List”) from the remaining ingredients (we called the “Remaining list”). This distinction is adopted considering both the average distribution of soy-based ingredients and the ingredients related to other legume species within ingredient lists, which have variable lengths.

Ingredient tagging

After structuring the unstructured textual data extracted from the ingredient lists, we are able to identify only the legume species present in the corpus. To perform this tagging stage, we use a mixed method relying both on text-mining techniques and on the scientific expertise of a group of six food scientists who contributed to designing a dictionary of tagging rules. One of the main difficulties in identifying mentions of plant species (in our particular case, soy and non-soy legume (NSL) species) in ingredient lists relates to the fact that these species are almost systematically designated by their vernacular name or by part of their vernacular names, with the latter sometimes being misspelled. In addition, some of these names can be sources of error or confusion depending on the context in which they are cited. To overcome such issues and rigorously associate ingredients with the scientific name of the plant species to which they refer, the tagging method used is based on the identification of those stems most often associated with a given species and on the context in which they are written. Thus, we first extract from the correctly parsed ingredient lists all ingredients containing stems that can be associated with a pulse species (“pea”, “bean”, “lens” or “lentil”, “gram”, etc.) as well as the words immediately preceding and following them. Such words are of utmost importance, as they tell us about the type of species mentioned. For example, for “black-eyed peas” and “grass peas”, it is erroneous to deduct the presence of Pisum sativum L. simply from the presence of the stem pea. Here, each of the terms preceding “peas” refers to a different pulse species—Vigna unguiculata L. for “black-eyed peas” and Lathyrus sativus L. for “grass peas”.

This fine-grained analysis is conducted by pulse food science experts, who helped identify, for each combination of a given stem and its preceding and following words, the correct scientific name of the related species. When no consensus is reached among the experts or when it is impossible to clearly identify a plant species, we use an external gold standard database, such as the Catalog of Life, or we include outside experts. This is particularly the case when it comes to understanding and correctly identifying the rich diversity of pulse species involved in “dahl”, a traditional Indian meal. If, nevertheless, no consensus is reached or the identification of a species is not assured, then the ingredient is considered ambiguous.

Nevertheless, some other ambiguities remain between fresh and dried legumes. Firms sometimes do not distinguish between the use of fresh and dried legumes, although the term “pulses” is reserved for pulses harvested once their grains have dried. To avoid any ambiguity, because it is sometimes impossible to deduce from food product compositions whether the legumes used are fresh or dry, we label the pulse species identified in our corpus as NSL. This important annotation work results in the production of a dictionary of expressions, which we provide as supplementary data (Table S3). This dictionary is then translated into a computer automaton to label all the ingredients presenting the same form in our corpus that refers to a legume species. The following Fig. 6 summarizes this tagging procedure.

Fig. 6: Flow chart of the tagging process.
figure 6

Here, each stage of the tagging process is illustrated. The dictionary mentioned in the chart is given as supplementary material (Table S3).

For the purpose of this study, the same tagging process is applied to product descriptions to identify the legume species mentioned on packaging beyond the ingredient list. Such mentions tell us how a particular species is promoted and how it is used to add value to the product. All food product descriptions (as reported in the Mintel database) are tokenized and cleaned before applying the same set of labeling rules that we use for ingredient lists. For this specific stage, no expert supervision is mobilized.

Final dataset

At the end of these processing stages, once ambiguous mentions of species are discarded, we obtain a dataset of 343,309 products, not considering the following specific cases (n = 4800): (i) ingredients that refer ambiguously to a legume species (as in the expression “pulse bean”), (ii) false friends (as in the expression “coffee bean” or “coco pea”), (iii) legume species that do not enter into the subfamily of pulses such as Arachis hypogaea L., and (iv) cases where the tagger returned an error, i.e., a “none” value.

Our final dataset is composed of three subcorpora: (i) products whose lists of ingredients contain soy-based ingredients and no NSL-based ingredients (n = 249,425), (ii) products whose lists of ingredients contain NSL-based ingredients and no soy-based ingredients (n = 65,653), and (iii) products containing both soy-based and NSL-based ingredients (n = 28,231).

Statistics

The descriptive statistics presented in the paper are computed considering these three subcorpora at the following different scales: (i) at the product level, identifying the main characteristics of products containing legumes (soy and NSL), and (ii) at the species level, characterizing the way in which they are used in packaged food product launches. This second level constitutes the first step toward a more systematic analysis of the uses of these plant species. The present study involves an initial effort to categorize these species according to their frequency of appearance in the food industry (whether they are frequent, infrequent, or rare in food product formulations) and their ranking (whether they appear among the first or last items on ingredient lists based on the assumption that ingredient lists are ordered according to the weight of the ingredient). This categorization work is carried out by performing k-means clustering on the frequencies of appearance of legume species among the first ingredients on ingredient lists and among the last ingredients on ingredient lists, provided that these species appear at least one hundred times in the corpus (40 randomly chosen centroids for 3000 iterations, w a maximum of 2 to 11 clusters asked). The selection of the number of clusters to be retained is made using the silhouette coefficient method (\(S(i)=\frac{b(i)-a(i)}{\max \left\{a(i)-b(i)\right\}}\)), the graphical output of which is provided in the Supplementary Materials (Fig. S3 and Table S5). The results of this work can be found in Fig. 4 in the “Results and Discussion” section.