Abstract
Improving many-body computational efficiency is crucial for exploring condensed matter systems. However, existing acceleration methods are limited and mostly based on von Neumann-like architectures. Here we leverage the capabilities of Field Programmable Gate Arrays for conducting quantum many-body calculations and realize a tenfold speedup over Central Processing Unit-based computation for a Monte Carlo algorithm. By using a supercell structure and simulating the hardware architecture with High-Level Synthesis, we achieve \(O(1)\) scaling for the time of one sweep, regardless of the overall system size. We also demonstrate the utilization of programmable hardware to accelerate a typical tensor network algorithm for ground-state calculations. Additionally, we show that the current hardware computing acceleration is on par with that of multi-threaded Graphics Processing Unit parallel processing. Our findings highlight the advantages of hardware implementation and pave the way for efficient many-body computations.
Similar content being viewed by others
Introduction
Many-body computations are a pivotal research focus within the fields of condensed matter and statistical physics. These problems involve understanding systems composed of a large number of interacting particles, where the collective behavior cannot be easily deduced from the behavior of individual components. The complexity of such systems arises from the exponential growth of the Hilbert space with the number of particles, making exact solutions intractable, even for moderately sized systems. To tackle this challenge, researchers employ a variety of numerical techniques, including sampling methods such as classical and quantum Monte Carlo (MC)1, as well as blocking methods like Density Matrix Renormalization Group2 and tensor networks3,4. Among them, tensor networks offer a powerful way to approximate many-body wavefunctions by exploiting the inherent low entanglement in many physical systems, giving them an advantage in handling systems with a sign problem or frustration.
Traditionally, utilizing the central processing unit (CPU) of a general-purpose computer based on the von Neumann architecture, and the graphics processing unit (GPU) for parallel processing, or further combining the CPU and GPU in an integrated manner5, are typical means of implementing many-body calculations. However, certain bottleneck issues, such as the critical slowdown observed in calculations at the vicinity of a phase transition point, persist. Therefore, exploring additional parallel approaches or strategies for many-body computations is imperative to overcome these challenges.
Besides from the algorithmic point of view, an alternative approach is to optimize the computations from a hardware perspective. Taking the von Neumann architecture as the first reference example, a typical CPU follows a five-stage instruction execution process consisting of Instruction Fetch (IF), Instruction Decode (ID), Execution (EX), Memory Access (MEM), and Write Back (WB)6[Fig. 1a]. Moreover, instructions and data are stored in the same memory, which creates a bottleneck7 in performance due to the data transfer rate limitation between the processor and memory. Meanwhile, the CPU frequency in the GHz range is difficult to substantially improve due to physical limitations8. These bottlenecks discount the parallel performance of the CPU. For instance, although supercomputers feature multiple parallel processing units, increasing parallelism does not always lead to linear performance growth.
a In the case of CPU, initial data and instructions are stored in the same memory, and after passing through stages of Instruction Fetch (IF), Instruction Decode (ID), Execution (EX), Memory Access (MEM), and Write Back (WB), the expected data is computed. b In GPU, stages similar to those in CPU are kept, and the decoded instructions are concurrently executed in multiple threads to compute the expected data. c In FPGA, instructions are implemented by constructing logic gates (LGs), and initial data flows through several LGs to generate the expected data.
In general, one can improve computational efficiency by modifying or abandoning the von Neumann architecture, but this may come at the cost of losing flexibility. For example, GPUs introduce Single Instruction, Multiple Threads or Single Instruction, Multiple Data execution modes to improving parallelism [Fig. 1b]. However, the execution process still requires the IF, ID, and EX stages. Therefore, we categorize GPUs as hardware approximating the von Neumann architecture. In contrast, an application-specific integrated circuit (ASIC) is designed to perform specific tasks, while losing versatility at the hardware level, but making them highly efficient. For instance, the specialized Tensor Processing Units (TPUs) can accelerate the simulation of quantum many-body dynamic problems9. Between the extreme cases of CPU/GPU and ASIC, a Field-Programmable Gate Array (FPGA) can also execute computational tasks in parallel at the hardware level by arranging the computational instructions into logic gates [Fig. 1c]. Take a loop operation for example, FPGAs can utilize pipeline architecture to accelerate the operation by breaking down the computational tasks of each iteration into multiple stages. These stages are executed concurrently in parallel, allowing for overlapping computation and data transfer, which reduces the overall latency. Meanwhile, the structure maintains considerable versatility, as its logic gates are reconfigurable. It is worth mentioning that pipelining and other parallel optimizations can also be applied to the CPU. However, unlike the direct manipulation of logic gates in FPGA, this process still adheres to the von Neumann architecture, and thus, it is not discussed in detail in this paper.
Possessing valuable properties such as high throughput, high energy efficiency, and the capacity to maximize parallelism, FPGAs have emerged as versatile platforms in numerous fields, including quantum chemistry10,11, neural networks12, high-energy physics13, etc. For example, in a high-energy physics experiment, FPGAs process vast amounts of data from particle detectors, aiding in the analysis of particle collections and collisions13. The structure of the FPGA has even been implemented in a biological system14. In many-body computations, FPGA has a wide range of applications in MC simulations, for example, to determine the configurations of particle clusters15,16,17,18,19 or to simulate the integer-variable spin models like the various Ising models20,21,22,23,24,25,26,27,28,29,30,31. Although FPGAs have been proposed for handling tensor structures involved in neural networks32 and quantum circuits33, their substantial parallel advantages have not been fully exploited in the realm of many-body systems with more degrees of freedom, particularly in the area of tensor networks algorithms.
Here, we implement two typical many-body algorithms on FPGA, utilizing the two-dimensional (2D) XY model and the one-dimensional (1D) Heisenberg chain as examples. Using only a single FPGA chip, we showcase its parallel acceleration capabilities (The details of the chip are shown in the Methods Section). First, we construct a parallel scheme for the Metropolis MC updating procedure of the two-dimensional classical XY model, achieving a tenfold speedup compared to the CPU. Second, we restructure the Infinite Time-Evolving Block Decimation (iTEBD)34, a typical tensor network algorithm, making it suitable for hardware execution and also beating the CPU speed. Additionally, we demonstrate that the current acceleration achieved by FPGA computing is comparable to that of multi-threaded GPU parallel computation. Our work has achieved hardware FPGA acceleration for many-body computations, offering an alternative approach to this field in general.
Results
Parallel design of MC simulation for the XY model
We first focus on the 2D classical XY model, which is renowned for the Kosterlitz-Thouless (KT) phase transition35 and is described by the following Hamiltonian:
where \({{{\bf{x}}}}\) denotes a site on the square lattice, δ is a unit vector connecting this site with one of its nearest neighbors and the variable \({\theta }_{{{{\bf{x}}}}}\in \left[{{\mathrm{0,2}}}{{{\rm{\pi }}}}\right)\) represents the direction of the spin at site x.
We employ the straightforward Metropolis algorithm to evolve the configurations of the system at a temperature T = 0.85, which is close to the KT phase transition. Due to bipartition of the square lattice, (In other words, each unit cell contains two lattice sites), half of the spins can be evolved independently and in parallel. Therefore, we divide the lattice system into two sets of subsystems and update them alternately (as illustrated in Fig. 2) by following the approach used in ref. 20. In each subsystem, the MC procedure is further divided into three steps. First, random parameters are generated for each spin as the initialization of the Metropolis process. Second, the local internal energy difference caused by spin flipping is calculated, and the probability of flipping is determined. Third, a probabilistic decision to finish the Metropolis process is made, and the spin states are updated. More details of the Metropolis process are provided in Supplementary Note 1.
a The XY model is decomposed into two sublattices A and B. b Flowchart of the MC simulation where sublattice A and B are alternately evolved. c All the spins in the sublattice are updated by the single-spin-flip trials (SSFTs), where each trial is split to three small stages with different coloring background. d Pipelined time series sketch of one MC step. The colored rectangles represent three stages of SSFT, and the numbered circles and squares represent which spin is updated. The SSFTs is tailored into FPGA in the form of pipelined small stages.
During the evolution process for each subsystem, the iteratives steps for each spin are independent of each other, thereby avoiding the serial iteration of independent steps and fully utilizing available hardware resources for acceleration. The computation rate can be further improved by increasing the number of lattice sites in the unit cell. After implementing the entire MC algorithm into the FPGA hardware, we can readily employ the pipeline architecture to accelerate the computation. We calculate the energy of the system and obtain results consistent with those from CPUs (details shown in Supplementary Note 1). The computational speed advantage of FPGA compared to CPU and GPU is demonstrated in the Speedup Results part of this section.
Parallel design of iTEBD for the Heisenberg chain
The ground state \(\left|\Psi \right\rangle\) of a given system with the Hamiltonian H can be effectively determined by evolving an appropriate initial state \(\left|{\Psi }_{0}\right\rangle\) along the imaginary-time path, expressed as \(\left|\Psi \right\rangle ={{{\mathrm{lim}}}}_{\tau \to \infty }\exp \left(-\tau H\right)\left|{\Psi }_{0}\right\rangle\). In a tensor network calculation, discretizing the evolution operator \(\exp \left(-\tau H\right)\) into tensors greatly enhances the ability to capture entanglement in quantum systems, with the iTEBD algorithm serving as a prominent example of this approach. We then demonstrate the implementation of the iTEBD algorithm on FPGA for the 1D antiferromagnetic Heisenberg model. The Hamiltonian for this model is as follows:
where the interaction \({H}_{{ij}}\) is applied locally between the nearest neighbor sites i and j, and \({S}^{\mu }={\sigma }^{\mu }/2\), with σμ being the three Pauli matrices (μ = x, y, z).
In the usual iTEBD setup, the initial wavefunction \(\left|\Psi \right\rangle\) of the system is constructed as \(\left|\Psi \right\rangle =\ldots {\lambda }_{2,{ij}}{A}_{{jk}}^{\alpha }{\lambda }_{1,{kl}}{B}_{{lm}}^{\beta }{\lambda }_{2,{mp}}\cdots |\cdots \alpha \beta \cdots \rangle\), where A, B, and \({ \lambda}\)1,2 are local tensors, and all the virtual indices i, j, etc are contracted. These ingredients are updated by local imaginary time evolution \(|\widetilde{\Psi }\rangle ={\prod }_{\left\langle {ij}\right\rangle }{U}_{T,{ij}}\left|\Psi \right\rangle ={\prod }_{\left\langle {ij}\right\rangle }{{{{\rm{e}}}}}^{-\tau {H}_{{ij}}}\left|\Psi \right\rangle\) until convergence, where the local gate UT has a matrix representation
with the bases \(\left\{\left|\uparrow \uparrow \right\rangle ,\left|\uparrow \downarrow \right\rangle ,\left|\downarrow \uparrow \right\rangle ,\left|\downarrow \downarrow \right\rangle \right\}\), and the nonzero elements of UT are \({e}_{0}=\left\langle \uparrow \uparrow \right|{{{{\rm{e}}}}}^{-\tau {H}_{{ij}}}\left|\uparrow \uparrow \right\rangle =\left\langle \downarrow \downarrow \right|{{{{\rm{e}}}}}^{-\tau {H}_{{ij}}}\left|\downarrow \downarrow \right\rangle\), \({e}_{1}=\left\langle \uparrow \downarrow \right|{{{{\rm{e}}}}}^{-\tau {H}_{{ij}}}\left|\uparrow \downarrow \right\rangle =\left\langle \downarrow \uparrow \right|{{{{\rm{e}}}}}^{-\tau {H}_{{ij}}}\left|\downarrow \uparrow \right\rangle\), \({e}_{2}=\left\langle \uparrow \downarrow \right|{{{{\rm{e}}}}}^{-\tau {H}_{{ij}}}\left|\downarrow \uparrow \right\rangle =\left\langle \downarrow \uparrow \right|{{{{\rm{e}}}}}^{-\tau {H}_{{ij}}}\left|\uparrow \downarrow \right\rangle\). In our calculation, the length of one time step τ is set as τ = 0.01. Moreover, in the evolution process, it is necessary to utilize singular value decomposition (SVD) approximation to ensure that the virtual bond dimension is fixed as a finite value Db.
When performing tensor network computations on CPUs, tensors or matrices are often defined as classes. This allows for intuitive and crucial operations like permutation and reshaping of the classes themselves when contracting tensors with different indices. These operations enhance the intuitiveness and efficiency of tensor operations. However, in the existing FPGA architecture, the input and output data ultimately manifest in the tensor elements themselves. Operations like permutation and reshaping of tensors, which are suitable for CPUs, instead slow down computation speed and increase unnecessary resource consumption on FPGAs. Therefore, in FPGA implementation, one needs to modify the iTEBD algorithm from the perspective of tensor elements rather than classes. Following this principle, we redesign the iTEBD algorithm suitable for parallel execution on FPGAs (Fig. 3), avoiding excessive tensor operations. (Note that this improvement is also applicable to higher-order tensors that appear in high-dimensional tensor network algorithms.) For instance, instead of relying on the third-order tensor \({A}_{{ij}}^{\alpha }\), we employ two matrices \({A}_{\uparrow }\) and \({A}_{\downarrow }\), simplifying the process and enabling parallelism. A detailed discussion on the equivalence of this form with the original iTEBD can be found in Supplementary Note 2.
The left side depicts the variable flowchart, while the right side shows the corresponding matrix representations. The iTEBD process consists of three steps: First, after multiplying λ1 and λ2 into \({A}_{\uparrow }\), \({A}_{\downarrow }\), \({B}_{\uparrow }\), and \({B}_{\downarrow }\), \(A\) and \(B\) are directly multiplied to obtain the matrix \({AB}\). Second, \({AB}\) undergoes evolution through gate \({U}_{T}\) to yield \(M\), and singular value decomposition is applied to \(M\) to obtain matrices \(U\), \(\varLambda\), and \(V\), which are truncated to dimension \({D}_{{{{\rm{b}}}}}\) to generate \(\widetilde{A}\), \(\widetilde{B}\), and \({\lambda }_{2}\). Third, \({\lambda }_{1}^{-1}\) is multiplied into \(\widetilde{A}\), \(\widetilde{B}\) to obtain the new \(A\) and \(B\) for the next iteration of the iTEBD process.
In the iTEBD process, separated by the SVD step, the entire procedure is structured into three stages (Fig. 3): pre-SVD, SVD, and post-SVD. In the pre-SVD stage, λs are multiplied into matrices A↑, A↓, B↑ and B↓, followed by the multiplication of As and Bs to derive a new matrix AB by \(A{B}_{\alpha \beta }={A}_{\alpha }{B}_{\beta }^{{{{\rm{T}}}}}\) (α, β can be ↑ or↓). Subsequently, the non-zero matrix elements of UT are used to reassemble AB. In the SVD stage, the reassembled AB undergoes SVD using a parallelizable two-sided Jacobi algorithm (detail in Supplementary Note 2). In the post-SVD stage, the obtained results are truncated to preserve the dimension Db, resulting in new matrices A, B, and λs. These newly derived matrices are reintegrated into the initial step, facilitating the iterative process until convergence. Using the wavefunction from these processes, we get convergent ground state energy consistent with the exact solution for Heisenberg chain (details shown in Supplementary Note 2). The computational speed advantage of FPGA compared to CPU and GPU is shown in the next section.
Speedup results
We compare the computational time results on CPU, GPU and FPGA for both MC simulation for the XY model and iTEBD calculation for the Heisenberg model, showcasing the significant acceleration achieved by FPGA. The same set of modified MC and iTEBD algorithms are deployed across four different platforms: CPU, FPGA in sequential style, and FPGA in pipelined parallel style, GPU with 32 threads per warp. For the first two platforms, the simulations were designed in a completely sequential manner. We then measured the computation time of one MC step and one iTEBD step for all cases.
In the MC simulations for the XY model, we executed 10,000 MC steps for five different sizes L = 8, 16, 32, 64, and 128 on each platform to measure the average computation time. The results are displayed in Fig. 4a. While the computational speed of the FPGA in sequential style is slower than that of the CPU, significant speed improvement is achieved with the FPGA in pipelined parallel style. For example, at the system size L = 128, the computation speed of the FPGA in parallel is 16.5 times that of the CPU, 27.3 times that of the FPGA in sequential style. For these three platforms, the computation time is approximately proportional to \({L}^{2}\), with slight variations. We also optimize GPU performance with 32 threads per warp. The results show that FPGA parallelization achieves better acceleration for L ≤ 32. However, since only pipelined parallelism is used on the FPGA, its performance becomes slightly less efficient than GPU for larger L, where multi-threading on the GPU is effectively equivalent to unrolling. Additionally, the multi-threaded GPU acceleration generally outperforms the multi-core CPU. We confirmed this fact using OpenMP results with 2 to 32 CPU cores and demonstrated that the acceleration on the CPU is far from linear with respect to the number of cores (details in Supplementary Note 3). Compared to the higher acceleration achieved with the proposed probabilistic bit (p-bit) computer on the Ising model36, the relatively lower acceleration in the XY model is mainly due to the use of higher precision variables (64-bit floating-point) instead of the integer variables in the Ising model (details in Supplementary Note 6). This consumes more hardware resources, limiting our ability to further optimize using the unroll approach.
The computational time per step for a, MC simulation of the XY model and b, iTEBD calculation of the Heisenberg chain. The red squares, black triangles, orange diamonds, and blue circles represent the computational time for FPGA in pipelined parallel style, FPGA in sequential style (unpiped), GPU, and CPU, respectively. The error bars indicate a 95% confidence interval. The gray dashed lines represent the fitted results of the computational time with the fitting function \({X}^{x}\), where \(x\) is the fitting parameters, and \(X=L\) or \({D}_{{{{\rm{b}}}}}\) for a or b. In a, the fitted results for \(x\) for FPGA in parallel style, FPGA in sequential style, GPU, and CPU are 1.99, 2, 1.46, and 2.2, respectively. In b, the corresponding values of \(x\) are 2.88, 2.82, 1.09, and 3.02. The insets illustrate the corresponding data in log-log scale.
Figure 4b illustrates the computation time of one iTEBD step for the Heisenberg chain system. Starting from several fixed initial states, we executed 10,000 to 100,000 iteration steps for 11 different bond dimensions \({D}_{{{{\rm{b}}}}}\le\) 30, aiming to achieve convergent final states with the same level of precision. Similar to the MC simulation, the FPGA in sequential calculations takes the longest time compared to the other three platforms, while the computational speed of FPGA in parallel surpasses that of the CPU. However, the acceleration effect is not as pronounced as that of the MC calculations. For instance, at \({D}_{{{{\rm{b}}}}}=\) 30, the computation speed of the FPGA in parallel is 1.7 times that of the CPU, 6.7 times that of the FPGA in sequential style, and also achieves better acceleration than GPU with 32 threads per warp for \({D}_{{{{\rm{b}}}}} < \) 30. Nevertheless, this result remains encouraging because only the pipeline parallelism in FPGA is used so far. This suggests the potential for achieving more pronounced acceleration with larger \({D}_{{{{\rm{b}}}}}\).
From Fig. 4, it can be observed that the error in the runtime of the FPGA in parallel is smaller compared to the CPU for both the MC and the iTEBD calculation. This highlights another advantage of FPGA: deterministic latency. This is attributed to the architectural differences between the two platforms. When executing complex tasks on the CPU, rescheduling, instruction fetching, decoding, and other operations introduce variability in the runtime for each execution. However, due to its hardware programmability, each operation in an FPGA is fixed within a clock cycle, resulting in smaller runtime errors compared to the CPU.
Although in both the MC and iTEBD cases, the CPU outperform the FPGA in sequential due to the CPU’s higher frequency and the relative simplicity of the algorithm currently being considered, the parallelization effect on an FPGA generally results in higher computational efficiency than the CPU. Compared to the parallelism of CPUs, which mainly relies on multi-core processors and faces issues such as communication and synchronization overheads, as well as shared memory access conflicts, the parallelization on FPGA is much more efficient. FPGAs enable deterministic memory access by allowing direct data transfer between logic blocks without intermediate caching. This reduces overheads and simplifies processes. Additionally, parallel access to multiple memory blocks boosts bandwidth and avoids conflicts. These suggest that appropriately increasing the complexity of the existing algorithm to align with the FPGA architecture can help improve the computation rate. Here, we take MC calculations as an example, increasing the number of sites in the unit cell during the computation to compare with the previous results of bipartite lattice calculations.
We first set the number of lattice sites in the unit cell to four (see Supplementary Note 4) and calculate the XY model on a square lattice with only nearest-neighbor (NN) and with both NN and next-nearest-neighbor (NNN) interactions, as well as the XY model and Ising model on a square-octagonal lattice with NN interactions, shown in Fig. 5. (Note that, except for the number of lattice sites N = 16 and 64 in Fig.5d, which were emulated on our chip, all other data were simulated on a CPU following the FPGA architecture). Since the time per sweep on the FPGA scales linearly with N, we use \({aN}+b\) to fit the relationship between time and N. For the sequential results, \(b\,\ll\, a\), whereas for the pipelined results, \(b/a\) is larger, which can be used as a metric to characterize the speedup effect. The value of \(b/a\) shown in Fig. 5a is larger than that from Fig. 4a, where \(b/a=\) 35.0, indicating that increasing the number of lattice sites in a unit cell for the same system indeed contributes to improving the computation speed. The pipelined results in Fig. 5b, c, which consider NNN interactions and different lattice systems, exhibit similar acceleration effects. In Fig. 5d, the results for the Ising model also show a speedup compared to the CPU results36, suggesting that our system can also be extended to use p-bits36 for simulating quantum models. Furthermore, we consider a supercell by setting the number of lattice sites in the unit cell to \(N/4\), and with pipelining, it achieves \(O\left(1\right)\) scaling of computation speed with the number of lattice sites. In fact, for periodic many-body models with interactions beyond NN or NNN, a similar speedup can be achieved by increasing the number of lattice sites in the unit cell. For irregular graphs, more sophisticated algorithms are needed to sparsify the graph37. These optimizations require converting the existing variables into more hardware components. Although this makes the program appear lengthy, it can be easily achieved by appropriately encoding and sorting each variable that describes the hardware to generate the code needed for execution.
The computational time (simulated on a CPU for FPGA) per MC sweep for (a), the XY model on square lattice, b the XY model with NNN interactions on square lattice, (c), the XY model on square-octagonal lattice and d, the Ising model on square-octagonal lattice, with the system lattice sizes \(N=\) 16, 64, 144, 256, 1024, and 4096. The black triangles, red squares, and orange pentagons represent the computational time for FPGA in sequential style, in pipelined parallel style with 4 sites in one unit cell, and pipelined parallel style with a supercell, respectively. In (d), the cases \(N=\) 16 and 64 were verified on our FPGA chip, labeled as solid pentagons. The gray dashed lines represent the fitted results with \({aN}+b\). The ratios \(b/a\) extracted from the results of the pipelined parallel style with 4 sites in one unit cell, are shown at the center of each panel. The insets in the bottom-right corner illustrate the lattice configurations.
Discussion
While the ultimate solution to the exponential wall problem in many-body systems will eventually rely on quantum computers38, the full utilization of classical algorithms and hardware remains meaningful at present. Generally, the time complexity for computing many-body systems follows a growth pattern of \(h{X}^{a}\), where X represents the controllable degrees of freedom for the particular problem, such as system size \(L\) in Monte Carlo simulations or bond dimension \({D}_{{{{\rm{b}}}}}\) in tensor network calculations. Improvements in algorithms are reflected in the reduction of a, while hardware parallel optimization alters h. Therefore, comprehensive improvements in both algorithms and hardware will contribute to the efficiency of many-body computations.
It is worth noting that utilizing a hardware like FPGA for computations does not diminish the importance of algorithms. On the contrary, algorithms remain crucial, and it is essential to modify algorithms suitable for parallel execution on FPGA to fully leverage its parallel processing capabilities. For example, higher-order p-bit Ising machines are used to achieve \(O\left(1\right)\) scaling for one MC sweep in dense graph problems37, independent of the total system size. From an algorithmic perspective, whether it’s MC or iTEBD, our method is universal and can be naturally extended to solve other many-body models. Additionally, for MC algorithms, we can further improve the computational efficiency for MC simulation by applying cluster algorithms like Swendsen-Wang algorithm, and extend the system to general and irregular lattices by applying graph coloring strategy39. Back to the hardware optimization, we can further accelerate the MC simulation by utilizing a more suitable random number generator for FPGA40 and optimizing the operand bit width with strategy like mixed precision41.
Through comparison with a 32-thread GPU, we found that for the current relatively small \(L\) and \({D}_{{{{\rm{b}}}}}\), parallel FPGA acceleration has a slight advantage. However, the GPU demonstrates better scaling behavior with increasing \(L\) or \({D}_{{{{\rm{b}}}}}\). This indicates that in cases of larger \(L\) or \({D}_{{{{\rm{b}}}}}\), the parallel FPGA acceleration scheme needs improvement. The GPU’s advantage comes from having more memory, allowing more threads to achieve better acceleration (details in Supplementary Note 3). Although the GPU’s multi-threading approach is similar to unrolling, and its current computational performance scales well with the number of systems, the acceleration achieved by the GPU becomes less pronounced as the number of threads increases. In contrast, the FPGA demonstrates a more substantial acceleration effect when transitioning from serial to pipelined execution. Moreover, increasing the number of lattice sites in the unit cell further amplifies the acceleration effect, highlighting the significant potential of FPGA for parallel acceleration when more hardware resources are available. Additionally, the implementation of many-body computations across different hardware platforms is not a matter of competition. On the contrary, their synergy and complementarity can better drive breakthroughs in overcoming computational bottlenecks in the future.
For the case of iTEBD calculations in this paper, our primary goal is to implement a basic tensor network parallel algorithm on FPGA, thereby providing a reference for the hardware implementation of more complex tensor network parallel algorithms42. Therefore, we have focused on avoiding tensor reshaping and permutation operations at the algorithmic level and have only employed pipeline parallelism at the hardware level. However, the most time-consuming aspect of tensor calculations is the SVD process. With improvements in hardware programming for the SVD process, such as utilizing modified systolic arrays43, coordinate rotation digital computer (CORDIC)-like technology44, and p-bit implementations36,45,46, the computational speed of tensor network algorithms on FPGA will be enhanced.
Another aspect worth discussing is the current limitations of resources on FPGA chips. For example, although the Block Random Access Memory (BRAM) of the chip we are using is only 15.6 Mb, it can handle many-body problems of moderate scale for \(L\) or \({D}_{{{{\rm{b}}}}}\)(details in Supplementary Note 5). However, in tensor network computations, models involving sign problems47 and the competition of various phases48,49 require larger \({D}_{{{{\rm{b}}}}}\), and solving these problems requires FPGAs with larger memory. The success of TPU9,50 on tensor network calculations demonstrates that this is not a technically insurmountable problem. Therefore, optimizing multi-body computations on FPGAs can stimulate the development of chip hardware.
Conclusions
In summary, we have employed FPGA to achieve hardware acceleration for two distinct many-body computation methods. In the MC evolution of the 2D classical XY model, subsystem parallelization boosts the evolution speed, crucial for mitigating issues related to critical slowdown. In the iTEBD tensor network calculation of the 1D Heisenberg chain, we have devised a structure optimized for hardware dataflow, streamlining computational steps based on tensor elements. This structural optimization, combined with parallelization, results in speeds surpassing those attainable on a CPU. To the best of our knowledge, this represents the first FPGA implementation of a tensor network algorithm. Our current FPGA acceleration matches the performance of multi-threaded GPU parallel processing. Our findings offer an alternative perspective on many-body computations, fostering interdisciplinary collaboration between the fields of many-body algorithms and hardware.
Methods
FPGA hardware
The FPGA chip model used in this work is XC7K325TFFG900-2. The driving pulse period is 10 ns, corresponding to a frequency of 0.1 GHz. This chip has 890 Block RAM (BRAM) units, each with a capacity of 18 kilobits (Kb), abbreviated as BRAM_18K. In other words, the entire chip has only 15.6 Mb of memory available for storing data, which severely limits the size of the computational system. Additionally, the chip features 840 Digital Signal Processing slices of the 48E version (DSP48E), 407,600 flip-flops (FF), and 203,800 look-up tables (LUT). The detailed power usage of the chip is in Supplementary Note 7.
In the MC simulation for the XY model with \(L=\) 128, the utilization percentages of BRAM_18K, DSP48E, FF, and LUT are 42%, 39%, 5%, and 18%, respectively. In the iTEBD calculation for the Heisenberg chain with \({D}_{{{{\rm{b}}}}}=\) 30, the utilization percentages of BRAM_18K, DSP48E, FF, and LUT are 42%, 63%, 24%, and 69%, respectively. For Fig. 5, the CPU simulation following the FPGA architecture is performed using the High-Level Synthesis (HLS) strategy with the Xilinx Vivado HLS software.
CPU hardware
The CPU used in this work is the Intel Xeon Gold 6230, with a clock frequency of 2.10 GHz. The CPU computations are performed in one core of the Dell PowerEdge R740 server. In a single-core environment, the peak memory for the MC simulation of \(L=\) 128 XY system can reach 80 Mb, with an average memory of 24 Mb. For the iTEBD calculation of a \({D}_{{{{\rm{b}}}}}=\) 30 Heisenberg system, both the peak and average memory are 16 Mb.
GPU hardware
The GPU used in this work is the Quadro K620, with a clock frequency of 1.058 GHz. For both of the \(L=\) 128 XY system and \({D}_{{{{\rm{b}}}}}=\) 30 Heisenberg system, the memory usage, including the data memory and CUDA library cache, is 168 Mb.
Data availability
All source data from the main manuscript can be accessed in Supplementary Data 1, and other data associated with this work are available from the corresponding authors upon request.
Code availability
The codes that support the findings of this study are available from https://zenodo.org/records/11079723.
References
Sandvik, A. W. Computational Studies of Quantum Spin Systems. In: AIP Conference Proceedings. AIP (2010).
White, S. R. Density matrix formulation for quantum renormalization groups. Phys. Rev. Lett. 69, 2863–2866 (1992).
Cirac, J. I., Pérez-García, D. & Verstraete, N. Schuch&F. Matrix product states and projected entangled pair states: Concepts, symmetries, theorems. Rev. Mod. Phys. 93, 045003 (2021).
Meurice, Y. & Unmuth-Yockey, R. Sakai&J. Tensor lattice field theory for renormalization and quantum computing. Rev. Mod. Phys. 94, 025005 (2022).
DePrince, E. & Hammond, J. R. Quantum Chemical Many-Body Theory on Heterogeneous Nodes. In: 2011 Symposium on Application Accelerators in High-Performance Computing) (2011).
Patterson, D. A. & Hennessy, J. L. Computer Organization and Design: The Hardware/Software Interface, 6 edn. Morgan Kaufmann Publishers (2020).
Backus, J. Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs. Commun. ACM 21, 29 (1978).
Markov, I. L. Limits on fundamental limits to computation. Nature 512, 147–154 (2014).
Morningstar, A. et al. Simulation of quantum many-body dynamics with tensor processing units: Floquet Prethermalization. PRX Quantum 3, 020331 (2022).
RodríGuez-Borbn, J. M. et al. Field programmable gate arrays for enhancing the speed and energy efficiency of quantum dynamics simulations. J. Chem. Theory Comput. 16, 2085–2098 (2020).
Gordon, M. S. et al. Novel Computer Architectures and Quantum Chemistry. J. Phys. Chem. A 124, 4557–4582 (2020).
Mittal, S. A survey of FPGA-based accelerators for convolutional neural networks. Neural Comput. Appl. 32, 1109–1139 (2018).
PandaX-Collaboration. Limits on the luminance of dark matter from xenon recoil data. Nature 618, 47–50 (2023).
Lv, H. et al. DNA-based programmable gate arrays for general-purpose DNA computing. Nature 622, 292–300 (2023).
Gothandaraman, A., Peterson, G. D., Warren, G. L., Hinde, R. J. & Harrison, R. J. FPGA acceleration of a quantum Monte Carlo application. Parallel Comput. 34, 278–291 (2008).
Gothandaraman, A., Peterson, G. D., Warren, G. L., Hinde, R. J. & Harrison, R. J. A Hardware-Accelerated Quantum Monte Carlo framework (HAQMC) for N-body systems. Comput. Phys. Commun. 180, 2563–2573 (2009).
Gothandaraman, A., Peterson, G. D., Warren, G. L., Hinde, R. J. & Harrison, R. J. A pipelined and parallel architecture for quantum Monte Carlo Simulations on FPGAs. VLSI Des. 2010, 1–8 (2010).
Weber, R., Gothandaraman, A., Hinde, R. J. & Peterson, G. D. Comparing hardware accelerators in scientific applications: a case study. IEEE Trans. Parallel Distrib. Syst. 22, 58–68 (2011).
Cardamone, S. et al. Field‐programmable gate arrays and quantum Monte Carlo: Power efficient coprocessing for scalable high‐performance computing. Int. J. Quant. Chem. 119, (2019).
Ortega-Zamorano, F., Montemurro, M. A., Cannas, S. A. & Franco, J. M. Jerez&L. FPGA Hardware Acceleration of Monte Carlo Simulations for the Ising Model. IEEE Trans. Parallel Distrib. Syst. 27, 2618–2627 (2016).
Yoshimura, C., Hayashi, M., Okuyama, T. & Yamaoka, M. FPGA-based Annealing Processor for Ising Model. In: 2016 Fourth International Symposium on Computing and Networking (CANDAR). IEEE (2016).
Baity-Jesi, M. et al. The Janus project: boosting spin-glass simulations using FPGAs. IFAC Proc. Vol. 46, 227–232 (2013).
Zhai, Q. et al. Scaling law describes the spin-glass response in theory, experiments, and simulations. Phys. Rev. Lett. 125, 237202 (2020).
Okuyama, T., Hayashi, M. & Yamaoka, M. An Ising computer based on simulated quantum annealing by path integral Monte Carlo method. IEEE International Conference on Rebooting Computing, 124-129 (2017).
Waidyasooriya, H. M., Araki, Y. & Hariyama, M. Accelerator Architecture for Simulated Quantum Annealing Based on Resource-Utilization-Aware Scheduling and its Implementation Using OpenCL. International Symposium on Intelligent Signal Processing and Communication Systems, 335–340 (2018).
Liu, C. -Y., Waidyasooriya, H. M. & Hariyama, M. Data-Transfer-Bottleneck-Less Architecture for FPGA-Based Quantum Annealing Simulation. In: 2019 Seventh International Symposium on Computing and Networking (CANDAR) (2019).
Waidyasooriya, H. M., Hariyama, M., Miyama, M. J. & Ohzeki, M. OpenCL-based design of an FPGA accelerator for quantum annealing simulation. J. Supercomput. 75, 5019–5039 (2019).
Liu, C.-Y., Waidyasooriya, H. M. & Hariyama, M. Design space exploration for an FPGA-based quantum annealing simulator with interaction-coefficient-generators. J. Supercomput. 78, 1–17 (2021).
Waidyasooriya, H. M. & Hariyama, M. Highly-Parallel FPGA Accelerator for Simulated Quantum Annealing. IEEE Trans. Emerg. Top. Comput. 9, 2019–2029 (2021).
Chung, Y.-H., Shih, C.-J. & Hung, S.-H. Accelerating Simulated Quantum Annealing with GPU and Tensor Cores. In: High Performance Computing) (2022).
Kubuta, T., Shin, D., Onizawa, N. & Hanyu, T. Stochastic Implementation of Simulated Quantum Annealing on PYNQ. In: 2023 International Conference on Field Programmable Technology (ICFPT) (2023).
Zhang, K., Hawkins, C., Zhang, X. Hao, C. & Zhang, Z. On-FPGA training with ultra memory reduction: A low-precision tensor method. arXiv:2104.03420, (2021).
Levental, M. Tensor networks for simulating quantum circuits on FPGAs. arXiv:2108.06831, (2021).
Vidal, G. Classical simulation of infinite-size quantum lattice systems in one spatial dimension. Phys. Rev. Lett. 98, 070201 (2007).
Kosterlitz, J. M. & Thouless, D. J. Ordering, metastability and phase transitions in two-dimensional systems. J. Phys. C: Solid State Phys. 6, 1181–1203 (1973).
Chowdhury, S., Camsari, K. Y. & Datta, S. Accelerated quantum Monte Carlo with probabilistic computers. Commun. Phys. 6, (2023).
Nikhar, S., Kannan, S., Aadit, N. A., Chowdhury, S. & Camsari, K. Y. All-to-all reconfigurability with sparse and higher-order Ising machines. Nat. Commun. 15, 8977 (2024).
Kim, Y. et al. Evidence for the utility of quantum computing before fault tolerance. Nature 618, 500–505 (2023).
Aadit, N. A. et al. Massively parallel probabilistic computing with sparse Ising machines. Nat. Electron. 5, 460–468 (2022).
Lin, Y., Wang, F. & Liu, B. Random number generators for large-scale parallel Monte Carlo simulations on FPGA. J. Comput. Phys. 360, 93–103 (2018).
Chow, G. C. T. et al. A Mixed Precision Monte Carlo Methodology for Reconfigurable Accelerator Systems. FPGA 12: Proceedings of the 2012 ACM-SIGDA International Symposium on Field Programmable Gate Arrays, 57–66 (2012).
Secular, P., Gourianov, N., Lubasch, M., Dolgov, S. & Jaksch, S. R. Clark&D. Parallel time-dependent variational principle algorithm for matrix product states. Phys. Rev. B 101, 235123 (2020).
Luk, F. T. A parallel method for computing the generalized singular value decomposition. J. Parallel Distrib. Comput. 2, 250–260 (1985).
Zhang, S., Tian, X., Xiong, C., Tian, J. & Ming, D. Fast Implementation for the Singular Value and Eigenvalue Decomposition Based on FPGA. Chin. J. Electron. 26, 132–136 (2017).
Niazi, S. et al. Training deep Boltzmann networks with sparse Ising machines. Nat. Electron. 7, 610–619 (2024).
Singh, N. S. et al. CMOS plus stochastic nanomagnets enabling heterogeneous computers for probabilistic inference and learning. Nat. Commun. 15, 2685 (2024).
Liu, Y., Lv, S., Yang, Y. & Zou, H. Signatures of Quantum Criticality in the Complex Inverse Temperature Plane. Chin. Phys. Lett. 40, 050502 (2023).
Liu, W.-Y. et al. Gapless quantum spin liquid and global phase diagram of the spin-1/2 J1-J2 square antiferromagnetic Heisenberg model. Sci. Bull. 67, 1034–1041 (2022).
Zou, H., Yang, F. & Ku, W. Nearly degenerate ground states of a checkerboard antiferromagnet and their bosonic interpretation. Sci. China Phys. Mech. Astron 67, 217211 (2023).
Ganahl, M. et al. Density matrix renormalization group with tensor processing units. PRX Quantum 4, 010317 (2023).
Acknowledgements
We thank Jianguo Ma, Deyan Sun for helpful discussions. This work is supported by National Natural Science Foundation of China Grant (No. 12274126, 12105177) and National Key R&D Program of China (No. 2023YFF0719200).
Author information
Authors and Affiliations
Contributions
Haiyuan Zou and Qibin Zheng supervised the project. Songtai Lv, Yuchen Meng, and Yang Liu carried out the algorithmic modification, calculations and data analysis. Yang Liang, Xiaochen Yao, and Jincheng Xu carried out the FPGA hardware implementation. All the authors contributed in the manuscript writing and participated in the discussion.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Physics thanks the anonymous reviewers for their contribution to the peer review of this work. [A peer review file is available].
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lv, S., Liang, Y., Meng, Y. et al. Many-body computing on Field Programmable Gate Arrays. Commun Phys 8, 117 (2025). https://doi.org/10.1038/s42005-025-02050-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42005-025-02050-z