Why FPGA is faster than CPU and GPU?
2024-08-12 13:49:04 667
Both CPU and GPU belong to the von Neumann architecture, with instruction decoding and execution, and shared memory. The reason why FPGA is faster than CPU and GPU is essentially determined by its architecture of no instructions and no shared memory.
In the Feng architecture, since the execution unit may execute any instruction, it requires instruction memory, decoder, various instruction operators, and branch jump processing logic. And the function of each logic unit in FPGA is already determined during reprogramming, without the need for instructions.
The use of memory in Feng's structure has two purposes:
One is to save state, the other is communication between execution units.
1) Save state: The registers and on-chip memory (BRAM) in FPGA belong to their respective control logic and do not require unnecessary arbitration and caching.
2) Communication requirements: The connection between each logic unit of FPGA and surrounding logic units is already determined during reprogramming, and there is no need to communicate through shared memory.
In computationally intensive tasks:
In data centers, the core advantage of FPGA over GPU is latency.
Why is FPGA latency much lower than GPU?
Essentially, it is a difference in architecture. FPGAs have both pipeline parallelism and data parallelism, while GPUs have almost only data parallelism (pipeline depth is limited).
Processing a data packet has 10 steps, and FPGA can build a 10 stage pipeline. Different stages of the pipeline process different data packets, and each packet is processed after passing through 10 stages. Each processed data packet can be output immediately. The data parallel method of GPU is to make 10 computing units, each processing different data packets, but all computing units must do the same thing at a unified pace (SIMD). This requires 10 packets to enter and exit simultaneously. When tasks arrive individually rather than in batches, pipeline parallelism can achieve lower latency than data parallelism. Therefore, for tasks involving pipeline computing, FPGA inherently has a delay advantage over GPU.
ASIC is the best in terms of throughput, latency, and power consumption. But its research and development costs are high and the cycle is long. The flexibility of FPGA can protect assets. The data center is rented to different tenants for use. Some machines have neural network acceleration cards, some have Bing search acceleration cards, and some have network virtual acceleration cards, which can be very troublesome for task scheduling and operation. Using FPGA can maintain the homogeneity of data centers.
In communication intensive tasks, FPGA has greater advantages compared to GPU and CPU.
① Throughput: FPGA can directly connect to 40Gbps or 100Gbps Ethernet cables to process data packets of any size at line speed; And the CPU requires the network card to receive the data packet; GPUs can also process data packets with high performance, but they do not have network ports and still require network cards, which limits throughput due to network card and/or CPU limitations.
② Delay: The network card transfers data to the CPU, which processes it and then transfers it to the network card. In addition, clock interrupts and task scheduling in the system increase the instability of delay.
In summary, the main advantage of FPGA in data centers is its stable and extremely low latency, making it suitable for streaming computationally intensive and communication intensive tasks.
The biggest difference between FPGA and GPU lies in their architecture. FPGA is more suitable for low latency streaming processing, while GPU is more suitable for processing large batches of homogeneous data.
Success is like Xiao He, failure is like Xiao He. The lack of instructions is both an advantage and a weakness of FPGA. Doing something different requires a certain amount of FPGA logic resources. If the task to be done is complex and not highly repetitive, it will occupy a large amount of logical resources, most of which will be idle. At this point, it is better to use processors with von Neumann architecture.
FPGA and CPU work together, with strong locality and repeatability attributed to FPGA and complex attributed to CPU.