This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy

Open Source Software Optimization - Using the CPU SIMD Acceleration Technology to Optimize Software Performance

Aug 25, 2021

The following analyzes the optimization solution for comparing universal character strings on the ARM64 platform in the Go open-source community and describes how to use the single instruction multiple data (SIMD) parallel technology to improve the software execution speed.

SIMD is a hardware acceleration technology that uses one instruction to perform computing on multiple pieces of data at the same time. It can effectively improve the computing speed of the CPU and is mainly applicable to fields such as computing-intensive multimedia with low data correlation, mathematical computing, and AI.

Go language is a fast, static, and open-source language that can be used to easily build simple, reliable, and efficient software. Currently, there are various basic libraries, such as the math library, encryption and decryption library, image library, and encoding and decoding library. For scenarios that require high performance and the compiler cannot be optimized, the Go language is optimized by using the assembly technology at the bottom layer. The most important technology is SIMD.

1. Performance Problem in Character String Comparison

In service systems of various industries, a user needs to verify the user name or ID when logging in to the system, compare the goods IDs when ordering goods, and verify the ticket number when traveling. These operations cannot be separated from character string comparison. A character string is actually a byte array. In the Go language, it can be expressed in the format of []byte. The comparison of character strings becomes the comparison of the corresponding bytes in two arrays. See the following comparison function code:

func EqualBytes(a, b []byte) bool {
    if len(a) != len(b) {       // If the lengths are different, false is returned.
        return false
    }
    for i, _ := range a {
        if a[i]! = b[i] {           // Compare each byte in arrays a and b in sequence.
            return false
        }
    }
    return true
}

Looks pretty good, the logic's fine, but is that enough? Can it meet the requirements of all application scenarios? This document gives the answer through a performance test. The following data is obtained through the Go benchmark test:

Case Name Case Description Number of Executed Tests Each Operation Time (time/op) Data Volume Processed Per Unit Time
BenchmarkEqual/0-8 Compares the 0-byte character strings in the 8-core system. 330669548 3.64 ns/op
BenchmarkEqual/1-8 Compares the 1-byte character strings in the 8-core system. 227632882 5.27 ns/op 189.74 MB/s
BenchmarkEqual/6-8 Compares 6-byte character strings in the 8-core system. 132229749 9.09 ns/op 660.35 MB/s
BenchmarkEqual/9-8 Compares 9-byte character strings in the 8-core system. 100000000 10.1 ns/op 893.80 MB/s
BenchmarkEqual/15-8 Compares 15-byte character strings in the 8-core system. 83173801 14.4 ns/op 1041.32 MB/s
BenchmarkEqual/16-8 Compares 16-byte character strings in 8-core system. 79955283 15.0 ns/op 1069.79 MB/s
BenchmarkEqual/20-8 Compares 20-byte character strings in the 8-core system. 67353938 17.8 ns/op 1124.26 MB/s
BenchmarkEqual/32-8 Compares 32-byte character strings in 8-core system. 45706566 26.2 ns/op 1219.49 MB/s
BenchmarkEqual/4K-8 Compares 4 KB character strings in the 8-core system. 421956 2844 ns/op 1440.18 MB/s
BenchmarkEqual/4M-8 Compares 4 MB character strings in 8-core system. 334 3496666 ns/op 1199.52 MB/s
BenchmarkEqual/64M-8 Compares 64 MB character strings in 8-core system. 18 66481026 ns/op 1009.44 MB/s

Note: ns/op indicates the time (in nanoseconds) spent in executing a function each time. MB/s indicates the number of megabytes processed per second.

As shown in the table, as the data volume increases, the time consumed increases significantly. When the data volume reaches 4 MB, the time consumed is about 3.5 ms (3496666 ns/op). As a basic operation, the performance is poor.

2. SIMD Optimization Solution for Character String Comparison in Go Language Community

Is there any way to optimize the performance of character string comparison? The following unveils the mysterious SIMD optimization based on the optimization case of string comparison in the Go language community.

  1. The optimized patch is available on the Go community official website.

image

The assembly code is used both before and after optimization. To facilitate code learning, compare the Go code example with the Equal assembly code before optimization, as shown in the following figure.

image

As shown in the figure, the implementation logic of the two functions corresponds to each other. Here, for a line of Go code a[i]! = b[i]. Four assembly instructions are required:

  1. Two peeks are used to save the byte values in slice arrays a and b to the register.
  2. Compare the values of the two registers through the comparison (CMP) instruction and update the status register according to the comparison result.
  3. The Branch if EQual (BEQ) is to branch to loop if the value status register is equal, thats is, to continue to compare if the value is equal.

Now, we can start to analyze the optimized SIMD of Equal. The core idea here is to process multiple bytes of data at the same time by using a single instruction, greatly reducing the number of instructions for data loading and comparison operations, fully utilizing the computing power of SIMD computing components, and improving performance.

The following figure shows the comparison before and after using SIMD to optimize the assembly code. The code before optimization is very simple. After the SIMD instruction is used to optimize the code, the code gets complex. In this case, you can understand the implementation principle first (learning the details at this stage is not necessary), see Code Description Before Optimization. The main reason why the code becomes complex is that the code is divided into blocks. The 64 bytes blocks are processed cyclically. When the end of the array is less than 64 bytes, the remaining 16 bytes are divided into blocks until the remaining length is 1. The following figure shows the comparison before and after optimization and the block processing rules after optimization.

image

3. Code Description Before Optimization

Before optimization, the code cyclically obtains one byte from two arrays for comparison. Each byte requires two loading operations, one comparison operation, and one array end judgment operation, as shown in the following figure.

// func Equal(a, b []byte) bool
TEXT bytes·Equal(SB),NOSPLIT,$0-49
//----------Data loading----------
    // Obtain the data from the stack to the register.
    // Compare array lengths. If the array lengths are different, 0 is returned.
    MOVD a_len+8(FP), R1        // Obtain the length of array a.
    MOVD b_len+32(FP), R3      // Obtain the length of array b.
    CMP R1, R3                         // Compare the array lengths.
    BNE notequal                      // If the array lengths are different, branch to notequal.
    MOVD a+0(FP), R0              // Load the address of array a to the general register R0.
    MOVD b+24(FP), R2            // Load the address of array b to the general register R2.
    ADD R0, R1                         // Save the end address of array a to the general register R1.
//-----------------------------
//------Array cyclic comparison operation------
loop:
    CMP R0, R1                         // Determine whether the current time is at the end of the array a.
    BEQ equal                           // If the value is at the end, it indicates that the values are the same before. In this case, branch to the equal tag.
    MOVBU.P 1(R0), R4             // Load a byte from the array a to the general register R4.
    MOVBU.P 1(R2), R5             // Load a byte from the array b to the general register R5.
    CMP R4, R5                         // Compare the values of R4 and R5.
    BEQ loop                             // If the values are the same, continue the next round of loop operations.
//-----------------------------
//----------Not equal----------
notequal:
    MOVB ZR, ret+48(FP)          // If the arrays are different, 0 is returned.
    RET
//-----------------------------
//-------------Equal----------
equal:
    MOVD $1, R0                       // If the arrays are the same, 1 is returned.
    MOVB R0, ret+48(FP)
    RET
//-----------------------------

4. Code Description After Optimization

The optimized code is expanded cyclically. Therefore, all the code looks complex, but the logic is clear. That is, the array is divided into 64, 16, 8, 4, 2, or 1 byte blocks, and multiple vector registers are used, the advantage of processing a maximum of 16 bytes at the same time by using a single SIMD instruction is that the number of boundary checks is reduced. The assembly code is interpreted as follows (key instruction comments are added to the code):

// Parameter of the function. In this example, the parameter is passed through the register.
// Invoke the parent function of memeqbody and put the parameters into the following register
// R0: stores the address of array a.
// R1: stores the end address of array a.
//R2: stores the address of array b.
//R8: stores the comparison result.
TEXT runtime·memeqbody<>(SB),NOSPLIT,$0
//----------Determine the array length.--------------------------
// Determine the block to be processed based on the array length.
    CMP    $1, R1
    BEQ    one
    CMP    $16, R1
    BLO    tail
    BIC    $0x3f, R1, R3
    CBZ    R3, chunk16
    ADD    R3, R0, R6

//----------Process 64-byte blocks.----------
// Process 64-byte blocks cyclically.
chunk64_loop:
// Load the data block to which RO, R2 points to the SIMD vector register, and offset the RO, R2 pointer by 64 bits.
    VLD1.P (R0), [V0.D2, V1.D2, V2.D2, V3.D2]
    VLD1.P (R2), [V4.D2, V5.D2, V6.D2, V7.D2]
// Use compare instructions of SIMD. One instruction is used to compare 128 bits, that is, 16 bytes. The comparison result is stored in the V8-V11 register.
    VCMEQ  V0.D2, V4.D2, V8.D2
    VCMEQ  V1.D2, V5.D2, V9.D2
    VCMEQ  V2.D2, V6.D2, V10.D2
    VCMEQ  V3.D2, V7.D2, V11.D2
// Combine the SIMD with the operation instruction and store the result in the V8 register.
    VAND   V8.B16, V9.B16, V8.B16
    VAND   V8.B16, V10.B16, V8.B16
    VAND   V8.B16, V11.B16, V8.B16
// Run the following instructions to check whether there is a 64-byte block at the end. If yes, continue to process the 64-byte block.
// Check whether the values are equal. If not, branch to not_equal and return.
    CMP    R0, R6                             // Compare the values of RO and R6 and modify the flag bit of the register. It corresponds to the following BNE instruction.
    VMOV   V8.D[0], R4
    VMOV V8.D[1], R5                   // Transfer the result data stored in the V8 register to the R4 and R5 registers.
    CBZ    R4, not_equal
    CBZ    R5, not_equal                   // Branch instruction. If the bit of R4 or R5 register is 0, it indicates that the values are not equal. In this case, branch to not_equal.
    BNE    chunk64_loop                  // The flag bit is not 0, corresponding to RO!. If the value is R6, go to chunk64_loop.
    AND    $0x3f, R1, R1                   // Store only the last six bits at the end of R1. Here stores the block whose size is less than 64 bytes at the end.
    CBZ    R1, equal                         // R1 is 0, go to equal. Otherwise, go to the next step.

...............................................
...............................................

//----------Process the block whose length is 16 bytes cyclically.----------
chunk16_loop:
    VLD1.P (R0), [V0.D2]
    VLD1.P (R2), [V1.D2]
    VCMEQ    V0.D2, V1.D2, V2.D2
    CMP R0, R6
    VMOV V2.D[0], R4
    VMOV V2.D[1], R5
    CBZ R4, not_equal
    CBZ R5, not_equal
    BNE chunk16_loop
    AND $0xf, R1, R1
    CBZ R1, equal
//-----Process blocks whose end length is less than 16, 8, 4, or 2 bytes.-----
tail:
    TBZ $3, R1, lt_8
    MOVD.P 8(R0), R4
    MOVD.P 8(R2), R5
    CMP R4, R5
    BNE not_equal

lt_8:
    TBZ $2, R1, lt_4
    MOVWU.P 4(R0), R4
    MOVWU.P 4(R2), R5
    CMP R4, R5
    BNE not_equal

lt_4:
    TBZ $1, R1, lt_2
    MOVHU.P 2(R0), R4
    MOVHU.P 2(R2), R5
    CMP R4, R5
    BNE not_equal

lt_2:
    TBZ     $0, R1, equal

one:
    MOVBU (R0), R4
    MOVBU (R2), R5
    CMP R4, R5
    BNE not_equal
//------------------The value 1 is returned if the two values are the same.----------------
equal:
    MOVD $1, R0
    MOVB R0, (R8)
    RET
//----------------If the two values are different, 0 is returned.----------------
not_equal:
    MOVB ZR, (R8)
    RET

In the preceding optimized code, the data loading instruction VLD1 is used to load 64-byte data to the SIMD register at a time, and then the equal comparison instruction VCMEQ is used to compare the data stored in the SIMD register to obtain the result. Compared with the traditional single-byte comparison mode, the comparison performance of 64-byte data blocks is improved. For data blocks greater than 16 bytes but less than 64 bytes, one SIMD register is used to process the data blocks of 16 bytes at a time. For data blocks less than 16 bytes, common registers are used to store data. Data blocks of 8 bytes, 4 bytes, 2 bytes, and 1 byte are compared at a time.

5. Hands-on Labs

Developers who are interested can run the following instructions to experience the powerful SIMD technology.

  • Environment preparation
  1. Hardware configuration: Kunpeng (ARM64) Linux ECS- general computing-plus KC1 kc1.2xlarge.2 (8-core | 16 GB)
  2. Go language release >= 1.12.1. For details about how to prepare the development environment, see Configuring Go in the ARM64 Development Environment.
  3. Go language GitHub source code repository. Git installation and use is used for version control.
  • Procedure

The following operations involve the entire compilation test process on the Kunpeng server. This document has found two submission records before and after the optimization. The commit ID before the optimization is 0c68b79 and after is 78ddf27. You can perform the following operations:

# Find a directory for storing the Go source code repository, for example, /usr/local/src/exp.
mkdir /usr/local/src/exp
cd /usr/local/src/exp
# Use the git tool to pull the code repository of Go on the GitHub code hosting platform.
git clone https://github.com/golang/go
# Access the source code directory.
cd /usr/local/src/exp/go/src
# Create a new branch before-simd based on the submission record 0c68b79 before optimization. This branch contains the version before optimization.
git checkout -b before-simd 0c68b79
# Switch to the before-simd branch. At this time, the code file in the directory has been changed to the version before optimization.
git checkout before-simd
# Compile the Go source code to generate the Go development environment.
bash make.bash
# Set the current directory to the GOROOT directory.
export GOROOT=/usr/local/src/exp/go
# Run the Go benchmark instruction to test the performance and record the test result in the before-simd-bench.txt file.
go test bytes -v -bench ^BenchmarkEqual$ -count=5 -run  ^$ > before-simd-bench.txt
# Create a new branch after-simd based on the optimized submission record 78ddf27. This branch contains the optimized version.
git checkout -b after-simd 78ddf27
# Switch to the after-simd branch. The code file in the directory has been optimized.
git checkout after-simd
# Compile the Go source code again to generate the Go development environment.
bash make.bash
# Run the Go benchmark instruction to test the performance and record the test result in the after-simd-bench.txt file.
go test bytes -v -bench ^BenchmarkEqual$ -count=5 -run  ^$ > after-simd-bench.txt
# benchstat comparison result
benchstat before-simd-bench.txt after-simd-bench.txt
# Finally, you can view the code change based on the submission record 78ddf27.
git show 78ddf27

6. Optimization Result

Use Go benchmark to record the data before and after the optimization, as described in section "." Use the Go benchstat to compare the performance before and after the optimization. The result is shown in the following table:

Case Name Case Definition Each Operation Time Before Optimization Each Operation Time After Optimization Operation Time Comparison Data Volume Processed Per Unit Time Before Optimization Data Volume Processed Per Unit Time After Optimization Data Volume Comparison
Equal/1-8 Compares 1-byte character strings in the 8-core CPU. 5.43ns ±0% 5.41ns ±0% -0.26% (p=0.048 n=5+5) 184MB/s ±0% 185MB/s ±0% ~ (p=0.056 n=5+5)
Equal/6-8 Compares the 6-byte character string in the 8-core CPU. 10.0ns ±0% 6.2ns ±0% -38.30% (p=0.008 n=5+5) 598MB/s ±0% 972MB/s ±0% +62.56% (p=0.008 n=5+5)
Equal/9-8 Compares the 9-byte character string in the 8-core CPU. 12.3ns ±0% 6.6ns ±0% -46.67% (p=0.029 n=4+4) 729MB/s ±0% 1373MB/s ±0% +88.35% (p=0.008 n=5+5)
Equal/15-8 Compares 15-byte character strings in the 8-core system. 17.0ns ±0% 7.4ns ±2% -56.20% (p=0.008 n=5+5) 881MB/s ±0% 2015MB/s ±2% +128.78% (p=0.008 n=5+5)
Equal/16-8 Compares 16-byte character strings in the 8-core system. 17.8ns ±0% 5.8ns ±0% -67.36% (p=0.008 n=5+5) 901MB/s ±0% 2755MB/s ±0% +205.94% (p=0.016 n=4+5)
Equal/20-8 Compares 20-byte character strings in the 8-core system. 20.9ns ±0% 8.0ns ±0% -61.75% (p=0.000 n=5+4) 956MB/s ±0% 2502MB/s ±0% +161.84% (p=0.016 n=5+4)
Equal/32-8 Compares the 32-byte character string in the 8-core CPU. 30.4ns ±0% 7.6ns ±0% -74.85% (p=0.008 n=5+5) 1.05GB/s ±0% 4.19GB/s ±0% +297.49% (p=0.008 n=5+5)
Equal/4K-8 Compares 4 KB character strings in the 8-core CPU. 3.18s ±0% 0.21s ±0% -93.49% (p=0.000 n=5+4) 1.29GB/s ±0% 19.70GB/s ±0% +1428.63% (p=0.008 n=5+5)
Equal/4M-8 Compares 4 MB character strings in the 8-core CPU. 3.69 ms ±1% 0.65 ms ±4% -82.36% (p=0.008 n=5+5) 1.14GB/s ±1% 6.45GB/s ±4% +466.96% (p=0.008 n=5+5)
Equal/64M-8 Compares 64 MB character strings in the 8-core CPU. 63.3 ms ±0% 17.4 ms ±5% -72.54% (p=0.008 n=5+5) 1.06GB/s ±0% 3.86GB/s ±5% +264.45% (p=0.008 n=5+5)

Note: ns/op: indicates the time (in nanoseconds) spent in executing a function each time. ms/op: indicates the time (in milliseconds) spent in executing a function each time. MB/s: indicates the number of megabytes processed per second. GB/s: indicates the G bytes processed per second.

The preceding table shows that the performance of all test cases is improved after SIMD optimization. When the data size is 4 KB, the performance improvement rate is the highest, and the time consumption is reduced by 93.49%. Data processing volume per second increased by 14.29 times.