c++ - Performance AVX/SSE assembly vs. intrinsics -
i'm trying check optimum approach optimizing basic routines. in case tried example of multiplying 2 float vectors together:
void mul(float *src1, float *src2, float *dst) { (int i=0; i<cnt; i++) dst[i] = src1[i] * src2[i]; };
plain c implementation slow. did external asm using avx , tried using intrinsics. these test results (time, smaller better):
asm: 0.110 ipp: 0.125 intrinsics: 0.18 plain c++: 4.0
(compiled using msvc 2013, sse2, tried intel compiler, results pretty same)
as can see asm code beaten intel performance primitives (probably because did lots of branches ensure can use avx aligned instructions). i'd utilize intrinsic approach, it's easier manage , thinking compiler should best job optimizing branches , stuff (my asm code sucks in matter imho, yet faster). here's code using intrinsics:
int i; (i=0; (minteger)(dst + i) % 32 != 0 && < cnt; i++) dst[i] = src1[i] * src2[i]; if ((minteger)(src1 + i) % 32 == 0) { if ((minteger)(src2 + i) % 32 == 0) { (; i<cnt-8; i+=8) { __m256 x = _mm256_load_ps( src1 + i); __m256 y = _mm256_load_ps( src2 + i); __m256 z = _mm256_mul_ps(x, y); _mm256_store_ps(dst + i, z); }; } else { (; i<cnt-8; i+=8) { __m256 x = _mm256_load_ps( src1 + i); __m256 y = _mm256_loadu_ps( src2 + i); __m256 z = _mm256_mul_ps(x, y); _mm256_store_ps(dst + i, z); }; }; } else { (; i<cnt-8; i+=8) { __m256 x = _mm256_loadu_ps( src1 + i); __m256 y = _mm256_loadu_ps( src2 + i); __m256 z = _mm256_mul_ps(x, y); _mm256_store_ps(dst + i, z); }; }; (; i<cnt; i++) dst[i] = src1[i] * src2[i];
simple: first address dst aligned 32 bytes, branch check sources aligned.
one problem c++ implementations in beginning , @ end not using avx unless enable avx in compiler, not want, because should avx specialization, software should work on platform, avx not available. , sadly there seems no intrinsics instructions such vmovss, there's penalty mixing avx code sse, compiler uses. if enabled avx in compiler, still didn't below 0.14.
any ideas how optimize make instrisics reach speed of asm code?
your implementation intrinsics not same function implementation in straight c: e.g. if function called arguments mul(p, p, p+1)
? you'll different results. pure c version slow because compiler ensuring code exactly said.
if want compiler make optimizations based on assumption 3 arrays not overlap, need make explicit:
void mul(float *src1, float *src2, float *__restrict__ dst)
or better
void mul(const float *src1, const float *src2, float *__restrict__ dst)
(i think it's enough have __restrict__
on output pointer, although wouldn't hurt add input pointers too)
Comments
Post a Comment