Introduction

The introduction of input–output (I–O) analysis as a fundamental tool to analyze the inter-relationship between economic sectors of a country was pioneered by W. Leontief, who proposed the construction of the first I–O tables for the United States for the years 1919 and 19291,2. An I–O table summarizes how the products (outputs) of a given industry or economic sector are used as input to other industries or sectors within the same, or different, economies (for instance, in the case of import/export exchanges with other countries)3. Understanding the structure and relevance of industrial sectors and countries within the so-called global value chains (GVCs), encompassing the different stages of the production process across different countries, is of central importance4. To achieve this, a number of indicators and measures have been devised that characterize the relative positioning of industries and economic sectors in the economy. These rely on the calculation of the following technical object,

$$\begin{aligned} G(A)=(\mathbbm {1}_N-A)^{-1}\ , \end{aligned}$$
(1)

the so-called Leontief inverse (or resolvent) matrix. Here, \(\mathbbm {1}_N\) is the \(N\times N\) identity matrix (where N is the number of industrial sectors) and A is a (row) sub-stochastic matrix, which is related in a simple fashion to the original I–O table. A sub-stochastic matrix A is such that its entries are non-negative and \(\sum _{j}A_{ij}\le 1\) for each row i. Notably, the upstreamness and downstreamness metrics proposed by Antrás, Chor and collaborators (see Sect. 2 for mathematical definitions) have become widely used and mainstream in recent years5,6,7. They are meant to represent the average distance of a sector from final demand, and from primary factors of production, respectively. One of the main practical challenges of the I–O analysis lies in the accurate and reliable compilation of inter-sectorial I–O tables from which the matrix A in formula (1) is derived. This issue is particularly felt at firm-level, where often only aggregate information is available8.

The main contribution of our paper is to show that up-/downstreamness measures and similar resolvent-like metrics can be approximated with high accuracy even when possessing only aggregate and local information about the inter-sectorial dependencies encoded within the I–O table. In this case, the required information only amounts to the row (or column) sums of the matrix A, representing the total intermediate demand per industry (or the total value of all inputs required by each industry).

More specifically, we propose an approach rooted in complexity science that reconstructs the most likely matrix A derived from I–O tables on the basis of limited/aggregated information and uses this surrogate information to compute the Leontief inverse and related indicators (e.g., upstreamness and downstreamness). These indicators can be derived from the aggregate information available in a fast—as this procedure does not require to perform a full matrix inversion– and accurate way. Moreover, in this work we connect the accuracy of our approximate framework with the spectral properties of the I–O tables.

Related literature

There is a vast literature concerning I–O models and how inaccuracies and noise in I–O tables may affect the determination of the relative ranking of industrial sectors and countries within the economy. One strand focuses on the accuracy of the empirical I–O matrix denoted by \(A_{emp}\) with respect to the true matrix \(A_{true}\). The main question is about how errors occurring in the compilation of the I–O tables propagate and affect measurements and predictions based on nonlinear functions of \(A_{emp}=A_{true}+H\) (for instance, the Leontief Inverse \((\mathbbm {1}_N-A_{emp})^{-1}\)), where H encodes the stochastic sources of error. Compiling the entries of the matrix \(A_{emp}\) is subject to many issues, for instance the difficulty in sampling and surveying firms and flows of goods with great accuracy9,10. This has provided the motivation to study stochastic models for the I–O analysis.

Evans11 and Quandt12 are among the first to look at this problem by constructing random models. Evans11 assumed that the error matrix H had only one non-zero row and that the errors could be propagated on a row-by-row basis. Quandt12 assumed that the errors \(H_{ij}\) on the matrix elements are independent and normally distributed with mean zero, solved the error propagation problem for a small-size system (e.g. \(2\times 2\)), and determined the confidence intervals on the expected Leontief Inverse. Later, Simonovits13 deduced the fundamental inequality \(\langle (\mathbbm {1}_N-A_{emp})^{-1}\rangle _H\ge (\mathbbm {1}_N-\langle A_{emp}\rangle _H)^{-1}\), where the average is taken with respect to independent matrix elements of H. This inequality circumvents the problem of inverting the matrix \(\mathbbm {1}_N-A_{emp}\), where the non-linearity involved in the Leontief matrix inversion makes it challenging to study how modifications (or inaccurate determinations) of the entries of the matrix \(A_{emp}\) would propagate.

One of the first comprehensive theoretical studies of stochastic I–O models is due to West14. His starting point is a random matrix H, of which the expected value and the standard error of all the elements are known, with the aim to provide approximating formulas for the expected value and the standard errors of the Leontief Inverse in terms of these known quantities. Some of the assumptions (for instance, that the errors \(H_{ij}\) be independent and normally distributed) are however not realistic or plainly incompatible with the sub-stochasticity constraint, and only lead to a closed-form solution for the mean and variances of the deviations from the “true” matrix under very restrictive choices for the variances of the errors in H.

More recently, this approach has been re-evaluated by Kogelschatz15—who assumed that the \(a_{ij}\) are Beta-distributed and derived estimates for the elements of the Leontief Inverse—and Kozicka16—who postulated more realistic distribution for the matrix entries, but provided explicit formulae only for small-size systems.

Within the empirical literature, a number of studies have been also undertaken to characterize the regional inter-sectorial dependence of industries and to discuss the challenges of reconstructing regional data from national accounts and surveys17.

Given the practical difficulties associated with compiling I–O tables, especially at the regional level, earlier scholars devised “shortcut” methods to estimate the Leontief inverse from incomplete or unreliable information, or even foregoing I–O tables altogether. Katz and Burford18,19 derived a formula under the assumption that the matrix A is uniformly drawn from the set of sub-stochastic matrices, and under the rather questionable technical condition that the covariance between the entries of the matrix and the output multipliers be null. Their work hinges on an earlier formula empirically derived by Drake20. The general approach based on finding “shortcuts” and foregoing a painstaking compilation of I–O tables was criticized on both technical and conceptual grounds21,22,23,24 before this line of investigation was dropped and even ignored altogether in the subsequent related literature.

The Leontief inverse and the associated indicators have also been looked at through the prism of complexity and network science. Cerina et al.25 analyzed the properties of the (global and regional) network of industries in different economies reconstructing the monetary goods flows (edges) using the I–O matrix. McNerney et al.26 used average national output multipliers to predict future economic growth and price changes. In27, a model for the propagation and amplification of idiosyncratic shocks along the I–O network is provided. In28, a network analysis of the World I–O Data set is undertaken to analyze the temporal interdependence between countries and industrial sectors.

In recent years the interest in I–O models has grown steadily29, also in view of a rather compelling connection to models of complexity and networks28,30. Moreover, many of these ideas can in principle be extended to more general sector-product spaces, which saw many uses for the study of the connection between complexity measures, productivity and economic growth31,32,33,34 (see however35,36 for mathematical issues surrounding the Economic Complexity Index and resolutions thereof).

Another strand of the literature looks at entropic measures of inter-sectorial complexity. Jacquemin and Berry37 introduce an entropy-based measure of corporate diversification, highlighting its additivity across different levels of product or industry aggregation. This metric is shown to better capture nuanced diversification patterns compared to alternatives like the Herfindahl index, particularly when assessing contributions of diversification within and across industry sectors. Their empirical analysis of 460 large U.S. manufacturing corporations demonstrates that diversification into closely related industries, as well as more distant sectors, correlates positively with corporate growth, emphasizing the utility of entropy measures for understanding diversification’s role in economic dynamics. The study38 explores the dynamics of economic growth through a model of export evolution derived from global trade network data. It links economic complexity to the diversity and specialization of national export baskets by employing stochastic differential equations to simulate resource transfer between exports. The authors introduce a novel complexity measure based on Shannon entropy, integrated with specialization metrics, and demonstrate its alignment with GDP per capita and growth trajectories across 223 countries over 21 years. This framework unveils the interplay of cooperative and competitive forces in trade, offering insights into growth potentials via counterfactual analyses. The subsequent work39 expands upon this by refining economic complexity measures using an iterative, entropy-based methodology. Their approach captures the diversity and ubiquity of exports within a bipartite network of countries and products, employing Shannon entropy to estimate the bare diversity of products and sectors. The study introduces intra- and inter-sectorial decomposition, providing nuanced assessments of economic efficiency and specialization. The results highlight the advantages of retaining full trade data granularity and demonstrate the utility of these measures in distinguishing national economic structures and developmental pathways. In the following section, we will focus on the works by Antràs and Chor4, Fally et al.6 and Miller et al.7, where different incarnations of the so-called upstreamness and downstreamness measures have been first proposed. An early example of a direct application of those measures for the analysis of empirical data on global value chains can be found in40, now used in multiple contexts41,42.

Definition of upstreamness and downstreamness

Antràs et al.4 considered a closed economy of N industries. For each industrial sector \(i= 1, \dots , N\) we indicate the value of gross output with \(Y_i\) and the total intermediate demand (i.e., the use of the output of an industry as a final good) with \(F_i\). Then the following equality holds in I–O tables:

$$\begin{aligned} Y_i= & F_i +Z_i = F_i+\sum _{j=1}^N a_{ij}= \end{aligned}$$
(2)
$$\begin{aligned}= & F_i +\sum _{j=1}^N d_{ij}Y_j \, \end{aligned}$$
(3)

with \(Z_i = \sum _{j=1}^N d_{ij}Y_j\) corresponding to the output of industry i used as intermediate input to other industries (intermediate demand) as shown in the scheme in Fig. 1. In Eq. (2), \(a_{ij}\) is the total value in monetary units (e.g. US dollars) of i’s output used to produce j’s output, while \(\{d_{ij}\}\) in Eq. (3) corresponds to the monetary amount of sector i’s output used to produce one monetary unit’s worth of sector j’s output, and it is related to the matrix A via the relationship \(d_{ij}Y_j = a_{ij}\). The final demand, as detailed in Sect. 4, comprises contributions from different factors including, among others, the final consumption expenditure by households and government, and exports.

Fig. 1
figure 1

Scheme of the structure of a single-country I–O table3,43,44.

Iterating the identity Eq. (2) within Eq. (3), one obtains an infinite sequence of contributions, each representing the use of sector i’s output at different levels within the value chain3

$$\begin{aligned} Y_i = F_i + \sum _{j=1}^N d_{ij}F_j + \sum _{j=1}^N \sum _{k=1}^N d_{ik}d_{kj}F_j +\ldots \ . \end{aligned}$$
(4)

We can finally rewrite Eq. (4) as follows

$$\begin{aligned} {\varvec{Y}} = [\mathbbm {1}_N-D]^{-1}{\varvec{F}} \end{aligned}$$
(5)

using \(\sum _{k\ge 0} D^k=[\mathbbm {1}_N-D]^{-1}\). In this case, \(\mathbbm {1}_N\) is the \(N\times N\) identity matrix, \(D=(d_{ij})\) contains each sector’s output in dollar values, and \(\varvec{F}\) is the vector of final demands. Antràs et al.4 hence proposed the following measure of upstreamness of the i-th industrial sector

$$\begin{aligned} U_{1i}= 1 \cdot \frac{F_i}{Y_i} + 2 \cdot \frac{\sum _{j=1}^N d_{ij}F_j}{Y_i} + 3 \cdot \frac{\sum _{j,k=1}^N d_{ik}d_{kj}F_j}{Y_i} + \ldots = \frac{([\mathbbm {1}_N - D]^{-2}{\varvec{F}})_i}{Y_i} \ , \end{aligned}$$
(6)

where each term contributing to Eq. (4) is weighted by their distance from final use and divided by the output of the sector \(Y_i\). The notation \((\cdot )_i\) is used to indicated the i-th component of the vector. By construction, the terms of the sum that are further upstream in the value chain carry larger weight in the calculation of the upstreamness. Inserting Eq. (4) in Eq. (6), we can rewrite the upstreamness as

$$\begin{aligned} {\varvec{U}_1} = [\mathbbm {1}_N-A_U]^{-1}{\varvec{1}}_N \ , \end{aligned}$$
(7)

where

$$\begin{aligned} A_U= Y^{-1}A = \begin{pmatrix} \frac{a_{11}}{Y_1} & \cdots & \frac{a_{1N}}{Y_1} \\ \vdots & \ddots & \vdots \\ \frac{a_{N1}}{Y_N} & \cdots & \frac{a_{NN}}{Y_N} \end{pmatrix}\ \end{aligned}$$
(8)

and \(Y =\textrm{diag}(Y_1,\dots ,Y_N)\). The vector \({\varvec{1}}_N\) is a column vector of N ones. The matrix \(A_U\) has non-negative elements, and in this convention it is row-substochastic, i.e., \(\sum _{j}(A_U)_{ij}\le 1 \ \forall i\). By construction \(U_{1i}\ge 1\), and it is precisely equal to 1 if no output of industry i is used as input to other industries, but it is only used to satisfy the final demand.

Later, Antràs et al.5 also established an equivalence between their upstreamness measure and a measure—defined in a recursive fashion—of the “distance” of an industry from the final demand proposed independently by Fally et al.6. Fally’s upstreamness \(U_2\) is defined as follows:

$$\begin{aligned} U_{2i} = 1 + \sum _{j=1}^N\frac{d_{ij}Y_j}{Y_i}U_{2j} \ . \end{aligned}$$
(9)

The idea is that \(\varvec{U}_2\) aggregates information on the extent to which a sector in a given country produces goods that are sold directly to final consumers, or that are sold to other sectors that themselves mainly sell to final consumers. Sectors selling a large share of their output to relatively upstream industries should be therefore considered to be more upstream themselves. Using the fact that \(d_{ij}Y_j = a_{ij}\) we obtain

$$\begin{aligned} {\varvec{U}_2}= [\mathbbm {1}_N-A_U]^{-1}{\varvec{1}}_N\ , \end{aligned}$$
(10)

where \(A_U\) is defined in Eq. (8) as presented in5.

On the input side, there exists an analogous accounting identity stating that sector i’s total input \(Y_i\) is equal to the value of its primary inputs (the so-called value added) \(V_i\) plus its intermediate input purchased from all other sectors, namely

$$\begin{aligned} Y_i= V_i +Z_i = V_i +\sum _{j=1}^N a_{ji}= V_i +\sum _{j=1}^N d_{ji}Y_j \ , \end{aligned}$$
(11)

and

$$\begin{aligned} {\varvec{Y}}= [\mathbbm {1}_N-D^T]^{-1}{\varvec{V}}\ . \end{aligned}$$
(12)

Similarly to Antràs et al. (cf. Eq. (6)), Miller and Temurshoev7 introduced the so-called downstreamness, measuring the “average distance between suppliers of primary inputs and sectors as input purchaser along the input demand supply chain” as follows:

$$\begin{aligned} D_{1i} = 1 \cdot \frac{V_i}{Y_i} + 2\cdot \frac{\sum _{j=1}^N V_j d_{ji}}{Y_i} + 3\cdot \frac{\sum _{j,k=1}^N V_j d_{jk}d_{ki} }{Y_i} + \ldots = \frac{([\mathbbm {1}_N - D^T]^{-2}{\varvec{V}})_i}{Y_i} \ . \end{aligned}$$
(13)

As before, using Eq. (12), we obtain

$$\begin{aligned} {\varvec{D}_1}= [\mathbbm {1}_N-A_D]^{-1}{\varvec{1}}_N \ , \end{aligned}$$
(14)

with

$$\begin{aligned} A_D= (A Y^{-1})^T = \begin{pmatrix} \frac{a_{11}}{Y_1} & \cdots & \frac{a_{N1}}{Y_1} \\ \vdots & \ddots & \vdots \\ \frac{a_{1N}}{Y_N} & \cdots & \frac{a_{NN}}{Y_N} \end{pmatrix}\ . \end{aligned}$$
(15)

The matrix \(A_D\) has non-negative elements, and it is row-substochastic, i.e., \(\sum _{j}(A_D)_{ij}\le 1 \ \forall i\). Finally, as in the upstreamness case, also for the downstreamness, Fally6 introduced an analogous iterative definition of the form

$$\begin{aligned} D_{2i} = 1 + \sum _{j=1}^N d_{ji}D_{2j} \ , \end{aligned}$$
(16)

which can be again mapped with simple manipulations onto Eq. (14) using \(Y_i d_{ji}=a_{ji}\).

Rank-1 approximation with local and aggregate information

In this section, we will discuss how to derive an approximation for the upstreamness and downstreamness metrics discussed in Sect. 2. Let us consider the resolvent \(G(A)=(\mathbbm {1}_N - A)^{-1}\), where the matrix A stands for \(A_U\) or \(A_D\) as defined in the previous section. Therefore, A has non-negative entries and is sub-stochastic. Recall that the vectors of upstreamness and downstreamness are defined as \({\varvec{U}}_1 = G(A_U){\varvec{1}}_N\) and \({\varvec{D}}_1= G(A_D){\varvec{1}}_N\), respectively (cf. Eq. (10), (14)). We are going now to assume that a detailed and accurate knowledge of all the entries of A is not available. The only available aggregate information is given by the 2N constants \(\varvec{r}=(r_1,\ldots ,r_N)\) and \(\varvec{c}=(c_1,\ldots ,c_N)\), namely the sums of the N rows and columns of A. This corresponds to knowing only the total intermediate demand per industry and the total value of all inputs required by each industry respectively. In the following we will analyse the single (row-sum only) and double (row- and column-sum) constraint cases. For the single constraint case, the knowledge of row sums of the I-O matrix (total intermediate demand of the associated sector) and of the vector of final demands is sufficient to infer the row sums of the matrix \(A_U\). Similarly, the knowledge of column sums of the I-O matrix (total inputs of the associated sector) and of the vector of value added is sufficient to infer the row sums of the matrix \(A_D\). For the double constraints case, the knowledge of row and column sums of the I-O matrix and of the vector of final demands/values added is not sufficient to infer the rows and column sums of either matrix \(A_U\) or \(A_D\), however this level of knowledge can be approximately achieved by positing that \(Y_i\approx \bar{Y}\), where \(\bar{Y}\) is the average of the \(Y_i\). In the following, we will assume that the row/column sums (single constraint) or row and column sums (double constraints) of the matrices \(A_U\) and \(A_D\) are known or retrievable from the corresponding row/column sums of the original I-O matrix. This lack of detailed information is actually quite common in supply chain and intrafirm network analysis8, which in turn leads to the need for inference and reconstruction methods to fill the gaps.

A simple rank-1 approximation \({\hat{A}}\) for the matrix A is

$$\begin{aligned} {\hat{A}}=\frac{1}{N}\varvec{g}\varvec{q}^T= \begin{pmatrix} \frac{g_1 q_1}{N} & \cdots & \frac{g_1 q_N}{N}\\ \vdots & \ddots & \vdots \\ \frac{g_Nq_1}{N} & \cdots & \frac{g_Nq_N}{N} \end{pmatrix}\ , \end{aligned}$$
(17)

where the entries of the column vectors \(\varvec{g} = (g_1,\ldots ,g_N)\) and \(\varvec{q}=(q_1,\ldots ,q_N)\) are determined imposing the constraint that A and \({\hat{A}}\) share the same row and column sums

$$\begin{aligned} r_i=&\sum _j A_{ij}\equiv \frac{\sum _{k} q_k}{N} g_i={\bar{q}}\ g_i\ , \end{aligned}$$
(18)
$$\begin{aligned} c_j=&\sum _i A_{ij}\equiv \frac{\sum _{k} g_k}{N} q_j={\bar{g}}\ q_j\ . \end{aligned}$$
(19)

This yields eventually the unique matrix

$$\begin{aligned} {\hat{A}} =\frac{1}{mN}\varvec{r}\varvec{c}^T \end{aligned}$$
(20)

with \(m=\frac{1}{N} \sum _{ij} A_{ij}=\frac{1}{N}\sum _j c_j=\frac{1}{N}\sum _i r_i\). The rank-1 matrix \({\hat{A}}\) in (20) is the so-called Maximum Entropy reconstructed matrix (see e.g.45,46) subject to the row and column constraints in (18) and (19) (see also47,48,49,50,51 for related works).

If the only information we have is about row sums, then the corresponding rank-1 approximation is even simpler

$$\begin{aligned} \hat{A} = \begin{pmatrix} \frac{r_1}{N} & \cdots & \frac{r_1}{N}\\ \vdots & \ddots & \vdots \\ \frac{r_N}{N} & \cdots & \frac{r_N}{N} \end{pmatrix} \ . \end{aligned}$$
(21)

Clearly, \({\hat{A}}\) has a single non-zero, real and positive eigenvalue \(\lambda _1=\frac{1}{mN}\sum _j r_j c_j\) (or \(\lambda _1=\frac{1}{N}\sum _j r_j\) in the case of only-row constraints) due to the Perron-Frobenius theorem, and \(N-1\) zero eigenvalues, therefore we may expect that this approximation will work better the larger the spectral gap (or equivalently the smaller the spectral radius in the bulk) of the original matrix A is52,53. The spectral gap is defined as \(\Gamma =\lambda _1-\Xi\), with \(\lambda _1\) real and \(<1\) being the Perron-Frobenius eigenvalue. The spectral radius is \(\Xi =\max \{|\lambda _2|,\ldots ,|\lambda _{N-1}|\}\). The empirical I–O matrices \(A_U,A_D\) typically show a large spectral gap, suggesting that the rank-1 approximation described in this section should be very effective.

As the empirical I–O matrices \(A_U,A_D\) are rather small (\(N=35\)), it is more informative to look at their spectral radius. In Sect. 5, we perform a thorough analysis of the spectra of the I–O matrices at the country level, and we study how the accuracy of our rank-1 formula is related to the spectral radius. We indeed find that there is a clear negative correlation between the two, i.e. the error made using our approximation increases with \(\Xi\). This said, even in the worst cases, the relative errors remain fairly negligible, and the formulae work very well across the entire dataset.

Employing this rank-1 approximation, we can now evaluate the approximate resolvent

$$\begin{aligned} G({\hat{A}})=(\mathbbm {1}_N - {\hat{A}})^{-1}= \mathbbm {1}_N+\frac{{\hat{A}}}{1-\frac{1}{m N}\sum _j r_j c_j}\ , \end{aligned}$$
(22)

using the Sherman-Morrison formula54 for the inverse of a rank-1 matrix, from which it follows that the upstreamness and downstreamness of the i-th industry are respectively approximated by

$$\begin{aligned} U_{1i}&\approx 1+\frac{r_i}{1-\frac{1}{m N}\sum _j r_j c_j} \end{aligned}$$
(23)
$$\begin{aligned} D_{1i}&\approx 1+\frac{{\tilde{r}}_i}{1-\frac{1}{{\tilde{m}} N}\sum _j {\tilde{r}}_j {\tilde{c}}_j}\ , \end{aligned}$$
(24)

where \(r_i,c_i\) and \({\tilde{r}}_i,{\tilde{c}}_i\) represent respectively the sum of rows and columns of \(A_U\) and \(A_D\). If only the constraints on rows are imposed, the formulae above reduce to

$$\begin{aligned} U_{1i}&\approx 1+\frac{r_i}{1-\frac{1}{N}\sum _j r_j } \end{aligned}$$
(25)
$$\begin{aligned} D_{1i}&\approx 1+\frac{{\tilde{r}}_i}{1-\frac{1}{ N}\sum _j {\tilde{r}}_j }\ . \end{aligned}$$
(26)

The approximate formulae above show that, within our rank-1 approximation, the upstreamness (downstreamness) of sector i is fully determined by the interplay of (i) local and aggregate information, namely of the total intermediate demand per sector (and/or the total value of all inputs required by a each sector), and (ii) a suitable average of the total intermediate demand (and/or the total value of all inputs) across all sectors in the economy.

In spite of the seemingly drastic approximation, which neglects a significant amount of finer intersectorial details, we will show that the aggregate information featuring in our rank-1 formulae is sufficient to determine with high accuracy the relative positioning of countries and sectors within the global value chains.

In the next sections, we will then calculate upstreamness and downstreamness measures on I–O tables from the NIOT Dataset (see Sect. 4), comparing the results obtained via our approximation with the full calculation using the original formulae, namely Eq. (10) and (14).

Dataset

Table 1 Countries and their codes in the NIOT database by WIOD44. Luxembourg is not included in our analysis as data present inconsistencies across the years.

The empirical I–O matrices used for the experiments have been constructed using the 2013 release of the National I–O tables by the World I–O Database (WIOD)44. The NIOT dataset comprises 39 countries –representing a large fraction of the major world economies – over the years 1995–2011. The list of countries and their codes considered in our empirical analysis is presented in Table 1. The structure of the I–O table of each country is schematically shown in Fig. 1. The intermediate demand for each country is reported for \(N=35\) economic sectors in terms of the flow (in US million dollars) between sectors. The full list of economic sectors and their codes included in our analysis is summarized in Table 2. The final demand is characterized in terms of (i) final consumption expenditure by households, (ii) final consumption expenditure by non-profit organizations serving households (NPISH), (iii) final consumption expenditure by government, (iv) gross fixed capital formation, (v) changes in inventories and valuables and (vi) exports. In the dataset sometimes the change in Inventories and Valuables can be negative, and were assumed to contribute to imports. The entries \(a_{ij}\) of each row of the full I–O table are then normalized by the vector outputs \(Y_j\). The normalized intermediate demand sub-matrix is sub-stochastic and represents the matrix \(A_U\). The \(r_i\) used in the model are simply the sums over the rows of the matrix \(A_U\) [or equivalently if normalized by columns the matrix \(A_D\), respectively in Eqs. (8) and (15)] .

Table 2 Sectors of the NIOT dataset by WIOD (2013 release) and their sector codes44.

Results

In this section, we compare our approximate formulae for downstreamness and upstreamness with single [Eqs. (25) and (26) respectively] and double contraints [Eqs. (23) and (24) respectively] with the measures obtained via direct inversion of the empirical I–O matrix [Eqs. (10) and (14) respectively].

Fig. 2
figure 2

Temporal dependence of country-level empirical upstreamness (Left) and downstreamness (Right) across the years 1995–2011.

Given the very weak temporal dependence of the empirical upstreamness and downstreamness measures as shown in Fig. 2 (consistent with previous analyses in40), in the following we will be able to aggregate together the analyses across all years in a robust way.

In Fig. 3 we plot the empirical average over all sectors (cyan squares) of the upstreamness for 39 countries (listed in Table 1) for all years (1995–2011) versus the approximate value with single (top panel) and double constraints (bottom panel), respectively obtained in Eqs. (25) and (23). We see that the empirical data (663 data points—39 countries \(\times\) 17 years) nicely collapse on top of the theoretical benchmark (blue dashed line). In the single constraint case, this implies that the average upstreamness coefficient for a country is determined with high accuracy by the knowledge of a single quantity \(\bar{z} = 1 - \frac{1}{N}\sum _j{r_j}\), corresponding to one minus the average total intermediate demand. We also show the upstreamness values for each sector in each country across the entire period (red full circles) constituting in total \(\sim 23k\) data points—35 sectors \(\times\) 39 countries \(\times\) 17 years. At the sector level, we observe a similar good agreement of the empirical exact upstreamness with the approximate values.

There are occasional deviations (including a systematic upward deviation for large values of the empirical downstreamness), whose origin can be traced back to a higher degree of heterogeneity in the A matrix with respect to the “flat” rank-1 model introduced in Eq. (21).

To identify the sectors that are typically less accurately captured by our approximation, we computed a simple indicator, \(\langle |\Delta _U^{\textrm{sect}}|\rangle\). This metric represents the average absolute difference between the empirical and approximated upstreamness values for each sector, aggregated across all years and all countries (see Fig. 4). The mining and agricultural sectors, among others, appear to exhibit greater heterogeneity in their input–output relationships with other sectors, as suggested by the higher differences values. This indicates that the structural differences in these sectors across countries may pose challenges for the accuracy of our approximation. Consequently, our method may perform less effectively for countries with economies that rely heavily on these sectors, as their heterogeneity is less well captured in the A-matrix approximation. In contrast, sectors such as housing, public administration, and education display lower values, suggesting more consistent and predictable input–output relationships, making them better suited for our approximation approach.

Fig. 3
figure 3

Empirical upstreamness versus approximated upstreamness. Cyan squares represent the upstreamness per country (39 countries) per year (11 year) averaged over 35 industrial sectors from the WIOD dataset (Release 2013). Red full circles represent the upstreamness for all industry sectors in all countries/all years. Top panel: Empirical upstreamness compared with single-constraint approximation in Eq. (25). Bottom panel: Empirical upstreamness compared with double-constraints approximation in Eq. (23).

Fig. 4
figure 4

Mean absolute differences between empirical upstreamness and approximation at sectorial level (aggregating over all countries and all years).

We have calculated a similar metric for the upstreamness at country-level (see Fig. 5), \(\langle |\Delta _U^{\textrm{country}}|\rangle\), averaging absolute differences over the period 1995-2011. The countries consistently more divergent (with respect to our approximation) are Spain, Korea, Russia and China.

Fig. 5
figure 5

Mean absolute differences between empirical upstreamness and approximation at country level (aggregating over all years).

Fig. 6
figure 6

Empirical downstreamness versus approximated downstreamness. Light green squares represent the downstreamness per country (39 countries) per year (11 year) averaged over 35 industrial sectors from the WIOD dataset (Release 2013). Blue full circles represent the downstreamness of all industry sectors in all countries/all years. Top panel: Empirical upstreamness compared with single-constraint approximation in Eq. (26). Bottom panel: Empirical upstreamness compared with double-constraints approximation in Eq. (24).

In the following we will also analyze more closely the relation between the error—discrepancy between the actual values of upstreamness (and downstreamness) calculated via direct inversion and those obtained via our approximate formula—and the spectral properties of the empirical I–O matrix A.

In Fig. 6, we repeat a similar analysis for the downstreamness, comparing the values obtained via direct inversion (Eq. (14)) with the approximate values of downstreamness imposing the single or double constraint on the knowledge of row sums, or row and column sums, respectively. Also for this measure, we observe a good agreement between exact and approximate values, both at the sectors (red full circles) and at the aggregate country level (cyan squares).

To assess the accuracy of the approximations, we quantify the correlation between the empirical and approximate measures using Pearson and Spearman correlation coefficients as summarised in Table 3. The results show that the double-constraints approximation provides a visible improvement for countries, with correlations nearly perfect in both upstreamness and downstreamness measures. However, for sectors, the improvement is marginal, as the single-constraint approximation already achieves high correlations.

Table 3 Comparison of Spearman and Pearson Correlation Coefficients between empirical and approximated upstreamness and downstreamness measure (1) at country or sector level and (2) considering the single or double-constraint approximation.

In the following, we analyze more closely the error made in the estimation of the upstreamness/downstreamness coefficients via our approximate formulae and link it to spectral properties of the underlying I–O matrix A. In particular, we define the following metric for assessing the error52

$$\begin{aligned} \sigma =\left\langle \left| \frac{\mathcal {R}_i^{(\textrm{emp})}}{\mathcal {R}_i^{(\textrm{approx})}}-1\right| \right\rangle \ , \end{aligned}$$
(27)

where \(\mathcal {R}_i\) represents either the upstreamness or the downstreamness values computed via direct inversion (\(\mathcal {R}_i^{(\textrm{emp})}\)) and via our approximate formula (\(\mathcal {R}_i^{(\textrm{approx})}\)) respectively. The average \(\langle \cdots \rangle\) is calculated over all sectors of a given country. Concerning the spectral properties, as shown in52,53 the accuracy of the approximation is related to the spectral gap of the matrix A. The matrix A has non-negative entries, therefore it has one real eigenvalue of largest magnitude \(\lambda _1\) (the Perron-Frobenius eigenvalue), and its spectral gap is defined as \(\Gamma =\lambda _1-\max \{|\lambda _2|,\ldots ,|\lambda _{N-1}|\}\). As the empirical I–O matrices are rather small (\(N=35\)) it is more informative to look at the spectral radius. We then introduce the spectral radius excluding the Perron-Frobenius \(\lambda _1\) as

$$\begin{aligned} \Xi =\max \{|\lambda _2|,\ldots ,|\lambda _{N-1}|\}\ . \end{aligned}$$
(28)

This definition is consistent with the approach used in the case of Gaussian matrices perturbed with a rank-1 matrix that may force an outlier to split off from the circular bulk53,55. In Fig. 7, we display the error \(\sigma\) made on the approximation for all countries in all years as a function of the spectral radius \(\Xi\) of the \(A_U\) matrix characterizing each country in each year. As expected, the error grows with the spectral radius, as the rank-1 approximation becomes less accurate in reproducing the underlying intersectorial interactions. In Fig. 8, we show the same relationship labelling the countries for a single year (2011). In the bottom panel, we show the eigenvalue spectrum of two selected countries—namely China and Mexico—displaying respectively among the maximal and minimal errors in the estimation, to highlight spectral differences in the displacement of eigenvalues in the bulk.

Fig. 7
figure 7

Error \(\sigma\) on approximated vs. exact upstreamness calculated for 39 countries, for the years \(1995-2011\) year averaged over the sectors as a function of the spectral radius \(\Xi\).

In this analysis, we find a clear negative correlation between the accuracy of the estimation and the spectral radius, i.e., the error made using our approximation increases (equivalently the accuracy of the approximation decreases) with \(\Xi\). In general though, even in the worst cases, the relative errors remain fairly small (\(\sim 5-6\%\)), and the approximation works very well across the entire sample.

Fig. 8
figure 8

Top panel: Error \(\sigma\) (between approximated vs. exact upstreamness) averaged over the sectors as a function of the spectral radius \(\Xi\) of the matrix \(A_U\) for all 39 countries in 2011. Bottom panel: Eigenvalue spectrum of the \(A_U\) matrix of China (CHN) and Mexico (MEX) in 2011.

Upstreamness under aggregation

In this section, we briefly consider how our approximation performs after the I–O data matrix has been subject to aggregation (consolidation) of different industrial sectors. The effects of aggregation—i.e. the procedure by which the data are looked at and lumped together at different “granularity” level—have been considered in many works (see56 for a comprehensive review). Here we consider the axiomatic formulation of aggregation provided in57, which is summarized below. Furthermore, our treatment will be confined to the upstreamness, and the row-only rank-1 approximation, as generalizations to the other cases are straightforward.

Consider the definition of upstreamness given in Eq. (7)

$$\begin{aligned} {\varvec{U}_1}= [\mathbbm {1}_N-A_U]^{-1}{\varvec{1}}_N\ . \end{aligned}$$
(29)

To make contact with Ref.57, we rewrite (29) as

$$\begin{aligned} {[}{\varvec{U}_1}^T]_N= {\varvec{1}}_N^T[\mathbbm {1}_N-A_U^T]^{-1}\ , \end{aligned}$$
(30)

in terms of row vectors \({\varvec{U}_1}^T\) and \({\varvec{1}}_N^T\), and a column-substochastic \(N\times N\) matrix \(A_U^T\). The notation \([\ldots ]_N\) indicates that the vector has length N.

Let us assume that we wish to aggregate the N “micro” industrial sectors or commodities into a set of \(M<N\) “macro” sectors or commodities. Formally, we can define two matrices, S and T, of size \(M\times N\) and \(N\times M\) respectively. The \(\{0,1\}\) matrix S indicates which micro-sectors should be combined together: \(S_{ij}=1\) if micro-sector j is to be included in macro-sector i. Thus, S is a column stochastic matrix with exactly one 1 in every column, and at least one 1 in every row. The matrix T indicates the proportional weights of each micro-sector within its macro-aggregate. The element \(T_{ji}\in (0,1)\) represents the weight \(w_{ji}\) that micro-sector j carries within macro-sector i, and therefore is such that \(\sum _j T_{ji}=1\). It follows that T is also column stochastic.

Forming the aggregate \(M\times M\) matrix \(A_U^\prime =SA_U^T T\) is the most common way used in the literature to create a smaller sub-stochastic matrix from the original matrix \(A_U\), which retains (at a coarser level of detail) some of the information about industrial sectors and commodities provided by \(A_U\). Although other choices of aggregation are possible, it was proven in57 that the aggregator \(A_U^\prime\) is the only one that satisfies three natural axioms of linearity, value added neutrality, and partitioning, therefore in the following we will confine ourselves to this case (the so called standard aggregator). It follows from the definition of S and T that \(ST=\mathbbm {1}_M\) and TS is a column stochastic, idempotent matrix of rank M (see57 for a proof).

Although in principle any non-negative column-stochastic matrix could play the role of T, in practice it makes most sense to define it as

$$\begin{aligned} T=\textrm{diag}(\varvec{w})S^T [\textrm{diag}(S\varvec{w})]^{-1}\ , \end{aligned}$$
(31)

where \(\varvec{w}\) is a vector of N non-negative numbers, and \(\textrm{diag}(\varvec{w})\) is the diagonal matrix having the vector entries on the diagonal (in their natural order). According to Charnes and Cooper, “The main justification for this mode of consolidation is that it conforms to the way data would be synthesized ab initio if SAT rather than A were the objective”58. To better understand how standard aggregation works, consider as an example a \(6\times 6\) matrix \(A_U^T\) (whose elements we denote \(\alpha _{ij}\) for simplicity, so \(\alpha _{ij} = a_{ji}/Y_j\)). Let

$$\begin{aligned} S = \begin{pmatrix} 0 & 0 & 1 & 1 & 0 & 0\\ 1 & 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 1\\ \end{pmatrix} \ , \end{aligned}$$
(32)

and \(\varvec{w} = (w_1,w_2,w_3,w_4,w_5,w_6)\). Then

$$\begin{aligned} T = \textrm{diag}(\varvec{w})S^T [\textrm{diag}(S\varvec{w})]^{-1}= \begin{pmatrix} 0 & \frac{w_1}{w_1+w_2} & 0 \\ 0 & \frac{w_2}{w_1+w_2} & 0 \\ \frac{w_3}{w_3+w_4} & 0 & 0 \\ \frac{w_4}{w_3+w_4} & 0 & 0 \\ 0 & 0 & \frac{w_5}{w_5+w_6} \\ 0 & 0 & \frac{w_6}{w_5+w_6} \\ \end{pmatrix} \ , \end{aligned}$$
(33)

and the aggregator becomes

$$\begin{aligned} A_U^\prime =S A_U^T T = \begin{pmatrix} \frac{w_3 (\alpha _{33}+\alpha _{43})+w_4 (\alpha _{34}+\alpha _{44})}{w_3+w_4} & \frac{w_1 (\alpha _{31}+\alpha _{41})+w_2 (\alpha _{32}+\alpha _{42})}{w_1+w_2} & \frac{w_5 (\alpha _{35}+\alpha _{45})+w_6 (\alpha _{36}+\alpha _{46})}{w_5+w_6} \\ \frac{w_3 (\alpha _{13}+\alpha _{23})+w_4 (\alpha _{14}+\alpha _{24})}{w_3+w_4} & \frac{w_1 (\alpha _{11}+\alpha _{21})+w_2 (\alpha _{12}+\alpha _{22})}{w_1+w_2} & \frac{w_5 (\alpha _{15}+\alpha _{25})+w_6 (\alpha _{16}+\alpha _{26})}{w_5+w_6} \\ \frac{w_3 (\alpha _{53}+\alpha _{63})+w_4 (\alpha _{54}+\alpha _{64})}{w_3+w_4} & \frac{w_1 (\alpha _{51}+\alpha _{61})+w_2 (\alpha _{52}+\alpha _{62})}{w_1+w_2} & \frac{w_5 (\alpha _{55}+\alpha _{65})+w_6 (\alpha _{56}+\alpha _{66})}{w_5+w_6} \\ \end{pmatrix}\ . \end{aligned}$$
(34)

Now, let us assume that the vector of N upstreamness values in Eq. (30) can be faithfully approximated by our formula in Eq. (25), which can be written as

$$\begin{aligned} {[}\hat{\varvec{U}_1}^T]_N= {\varvec{1}}_N^T+\frac{1}{1-\bar{r}_N}\varvec{r}^T\ , \end{aligned}$$
(35)

where \(\varvec{r}\) is the (column) vector of row sums of the matrix \(A_U\) (or the column sums of \(A_U^T\), \(r_j = \sum _{i=1}^N \alpha _{ij}\)), and \({\bar{r}}_N\) is their average. Let us further assume that the original data matrix \(A_U\) is not known in its entirety (only its row sums are known), but the sectors/commodities in \(A_U\) have been aggregated using a known pair of matrices ST—in other words, we are aware of what sectors/commodities have been lumped together (and with which relative weights) and what their aggregate outputs are, but we do not have more detailed information. We ask whether the knowledge of \(\varvec{r}, S\) and T is sufficient to determine \([\hat{\varvec{U}_1}^T]_M\), namely a faithful approximation for the M upstreamness values of the aggregate model. The answer is affirmative.

First, define

$$\begin{aligned} {[}{\varvec{U}_1}^T]_M= {\varvec{1}}_M^T[\mathbbm {1}_M-A_U^\prime ]^{-1}={\varvec{1}}_M^T[\mathbbm {1}_M-SA_U^T T]^{-1}\ , \end{aligned}$$
(36)

the vector of M upstreamness values, obtained using the aggregate matrix \(A_U^\prime\) as a source. The Leontief matrix on the r.h.s. of (36) is equal to the aggregate of the Leontief matrix of the so called companion matrix \({\bar{A}}_U= A_U^T TS\)57, namely

$$\begin{aligned} {[}\mathbbm {1}_M-SA_U^T T]^{-1} = S[\mathbbm {1}_N-{\bar{A}}_U]^{-1}T\ . \end{aligned}$$
(37)

The proof follows by expanding \([\mathbbm {1}_M-SA_U^T T]^{-1}=\mathbbm {1}_M +SA_U^T T+(SA_U^T T)^2+\ldots\), and using \((SA_U^T T)^n=S(A_U^T TS)^nT\) and \(TST=T\).

Imagine now that the true matrix \(A_U^T\) appearing on the l.h.s. of (37) is replaced by its best rank-1 approximation, given by \({\hat{A}}^T\) (see Eq. (21)). From the fact that the rank of the product of two matrices (\({\hat{A}}\) and TS) is smaller or equal than the smallest rank of the two factors, and that TS is rank-M (and of course none of the matrices involved is a null matrix), it is easy to deduce that in this case the companion matrix will also be rank-1. Applying Sherman-Morrison on the r.h.s. of (37), we get

$$\begin{aligned} S[\mathbbm {1}_N-{\hat{A}} TS]^{-1}T= \mathbbm {1}_M+\frac{1}{1-\phi (\varvec{r},S,T)}S({\hat{A}} TS)T=\mathbbm {1}_M+\frac{1}{1-\phi (\varvec{r},S,T)}S {\hat{A}} T\ , \end{aligned}$$
(38)

where we used \(S\mathbbm {1}_N T=ST=\mathbbm {1}_M\), and

$$\begin{aligned} \phi (\varvec{r},S,T)=\frac{1}{N}\sum _{i,k=1}^N r_i (TS)_{ik}\ . \end{aligned}$$
(39)

Eq. (38) shows how to construct a faithful rank-1 approximation for the upstreamness of the aggregate model starting from the knowledge of row sums of the original model, as well as of the matrices T and S implementing the aggregation.

Summary and outlook

In this paper, we have shown that the upstreamness and downstreamness measures introduced in the context of I–O analysis at both the inter-sectorial and country level can be faithfully recovered from the knowledge of aggregate and local information about the I–O table. In other words, the precise determination of the elements of the I–O matrix does not matter much, as long as their distribution does not deviate significantly from the “homogeneous” (flat) model (described in Eq. (21)), and the total intermediate demand per sector is ordinarily sufficient to provide an accurate estimate of the sector’s multipliers.

Our rank-1 approximation has been successfully tested on National I–O tables obtained from WIOD, where an excellent correlation is obtained between the empirical multipliers and the theoretical formulae (see Figs. 3 and 6). Small deviations from this remarkably robust regularity are readily attributed to stronger heterogeneity in the empirical sectorial data, which would require refinements to the (single or doubly constrained) rank-1 approximation presented here.

Indeed, sparser or more heterogeneous I–O matrices tend to have a larger spectral radius (or equivalently a smaller spectral gap), as demonstrated in Figs. 7 and 8. The quality of our rank-1 approximation is very high across the sectors and countries considered, but may be inferior for emprical matrices with larger spectral radii – as more eigenvalues besides the largest (Perron-Frobenius) start to play an important role.

In Section 6, we have also shown how our rank-1 approximation is well-behaved with respect to aggregation of sectorial data: knowing what sectors/commodities are lumped together, and what their aggregate outputs are, is sufficient to determine a faithful approximation for the upstreamness values of the aggregate model, as the rank-1 nature of the approximation is preserved upon aggregation.

In a recent paper59, we further employ the rank-1 approximation as a proxy to investigate the “puzzling” correlations observed between upstreamness and downstreamness at aggregate level40. More generally, our approach based on a rank-1 approximation demonstrates that local and aggregate information about I–O tables is ordinarily sufficient to determine the upstreamness and downstreamness at sectorial and country level with high accuracy, while at the same time providing analytically tractable formulae (Eq. (14), (7)) that avoid matrix inversions altogether. The rank-1 formulae prove also useful to approximate centrality values of nodes in complex networks52,60. As an outlook for future research, it will be interesting to test the accuracy of our formulae on firm-level data, where data availability and sparsity are greater concerns. In spite of the sparser nature of the data, we would expect our approximation to work well, as recently shown on experiments conducted on synthetic data52.