This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy

SIMD Optimization of ARM from the Perspective of Boolean Extremum Solution Algorithm

Aug 25, 2021

Abstract: A real case (Boolean extremum solution algorithm of NumPy) is used to describe the process of optimizing the algorithm using NEON and some key techniques. Compared with the C code optimization using a compiler, the performance is improved by about 80%. The following content is very helpful for developers who need to use NEON to implement high-performance computing.

Hardware configuration: Kunpeng (ARM64) server

  1. Solving the Extremum of Boolean Array

For a one-dimensional vector, NumPy provides an argmax function to solve the index corresponding to the maximum value of elements in the array.

import numpy as np

a = np.array([3, 1, 2, 4, 6, 1])

# Obtain the index corresponding to the maximum value of the element in a. In this case, the maximum value is 6, and the corresponding position index value is 4. (The index value starts from 0 by default.)

b=np.argmax(a)

print(b)

If the vectors are all composed of 0s and 1s, then the value returned by argmax is the index of the first 1 appearing, which is the Boolean vector we are going to introduce.

  1. C Language Standard Implementation of Bool_argmax in NumPy
static int
BOOL_argmax(npy_bool *ip, npy_intp n, npy_intp *max_ind,
            `PyArrayObject *NPY_UNUSED(aip))

{
    npy_intp i = 0;
    for (; i < n; i++) {
        if (ip[i]) {
            *max_ind = i;
            return 0;
        }
    }
    *max_ind = 0;
    return 0;
}

The basic idea of the algorithm is simple. The one-dimensional array IP is scanned linearly. If the value corresponding to the position is 1, the maximum index is saved to max_ind. For an array with a small data volume, it is efficient. However, in big data processing, a large number of 0 values exist, requiring a high solution performance of this function. How should the performance be optimized? Using the unique Neon parallel instruction set on the ARM platform seems to improve the computing efficiency (officially, four times). It does not hurt to try.

  1. NEON Intrinsics Implementation of Bool_argmax
int32_t sign_mask(uint8x16_t input)
{
    const int8_t __attribute__ ((aligned (16))) xr[8] = {-7,-6,-5,-4,-3,-2,-1,0};
    uint8x8_t mask_and = vdup_n_u8(0x80);
    int8x8_t mask_shift = vld1_s8(xr);

    uint8x8_t lo = vget_low_u8(input);
    uint8x8_t hi = vget_high_u8(input);

    lo = vand_u8(lo, mask_and);
    lo = vshl_u8(lo, mask_shift);

    hi = vand_u8(hi, mask_and);
    hi = vshl_u8(hi, mask_shift);

    lo = vpadd_u8(lo,lo);
    lo = vpadd_u8(lo,lo);
    lo = vpadd_u8(lo,lo);

    hi = vpadd_u8(hi,hi);
    hi = vpadd_u8(hi,hi);
    hi = vpadd_u8(hi,hi);

    return ((hi[0] << 8) | (lo[0] & 0xFF));
}

static int
BOOL_argmax(npy_bool *ip, npy_intp n, npy_intp *max_ind,
            PyArrayObject *NPY_UNUSED(aip))

{
    npy_intp i = 0;
    #if defined(__ARM_NEON__) || defined (__ARM_NEON)
        uint8x16_t zero = vdupq_n_u8(0);
        for(; i < n - (n % 32); i+=32) {
            uint8x16_t d1 = vld1q_u8((char *)&ip[i]);
            uint8x16_t d2 = vld1q_u8((char *)&ip[i + 16]);
            d1 = vceqq_u8(d1, zero);
            d2 = vceqq_u8(d2, zero);
            if(sign_mask(vminq_u8(d1, d2)) != 0xFFFF) {
                break;
            }
        }
 #endif
    for (; i < n; i++) {
        if (ip[i]) {
            *max_ind = i;
            return 0;
        }
    }
    *max_ind = 0;
    return 0;
}

There are two key points for the implementation:

  • The SIMD instruction can process at most 128-bit data, that is, 16 pieces of char(short) data. Therefore, more than 32 elements need to be processed by using C language. In this case, the SIMD and C implementation coexist.
  • The data assembling principle is as follows: Two groups of data are obtained from the array each time. Each group of data contains 16 pieces of char data. The mask operation is performed on the two groups of data. If one group of data contains 1, the operation result is not 0xFFFF, and the loop directly breaks and exits.

After the implementation, check the effect. The benchmark test result of NumPy is as follows:

       before           after         ratio
     [3f11db40]       [00b21d1b]
     <master>         <neon-argmax>
-       161±0.3μs       47.7±0.5μs     0.30  bench_reduce.ArgMax.time_argmax(<class 'bool'>)

The optimization effect is obvious with performance improved by 70%. Is it over here? After the code is submitted to the community, some old hands who are familiar with OpenCV put forward that the performance can still be further optimized. Did you notice that too many instructions are used in the sign_mask function? Actually, the mask operation does not require so many low-order and high-order byte accumulate and shift operations, which can be simplified into the following code:

int32_t _mm_movemask_epi8_neon(uint8x16_t input)
{
    int8x8_t m0 = vcreate_s8(0x0706050403020100ULL);
    uint8x16_t v0 = vshlq_u8(vshrq_n_u8(input, 7), vcombine_s8(m0, m0));
    uint64x2_t v1 = vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(v0)));
    return (int)vgetq_lane_u64(v1, 0) + ((int)vgetq_lane_u64(v1, 1) << 8);
}

After the optimization, the performance is improved by 8%. So far, the community finally accepts the submission.