codersnotes

How To Write A Maths Library In 2016 February 19th, 2016

Historical wisdom says that if you're writing a 3D vector maths library, you generally shouldn't base it around a SIMD data type. You'd be very tempted to, because obviously a SIMD type can do 4 adds for the price of one. In practice this has always been a mixed blessing. While you can indeed do 4 adds, the new type doesn't usually behave nicely when it comes to mixing with the rest of the codebase.

In order to get data in and out of the SIMD registers, or to pass the registers around to different functions, you'll usually find that you'll spend more instructions actually doing loads and stores than you saved by switching to SIMD.

Intel's SSE implementation is a particular troublemaker. The floating point registers on the x86 are separate registers from the SSE registers. This means that if you want to move a float into a __m128, the compiler has to store it back to memory, and then reload it. This completely nerfs any chance you had at getting good maths performance.

And so the advice usually given is to forget it: just stick with a simple struct with 3 floats in. SIMD optimization is best saved for a few special cases where you can write tight loops over specially organized data, such as CPU vertex skinning, or a cloth simulation perhaps. But don't use it for your regular 3D math library, because it'll probably end up slower.


Yet facts are not written in stone. What was true 10 years ago may not be true today. Advances in compiler technology mean we have to re-evaluate our beliefs every few years. I'm going to mention 3 particular advances here that change this:

  1. Scalar SSE by default
    When I said earlier that Intel's FP registers are separate from the SSE registers, that's missing part of the picture. You see, it turns out that the SSE registers aren't just for SIMD: you can do regular floating point arithmetic with them too, at no cost. This means you don't ever need to transfer floats back between FP and SSE registers - you just keep them in SSE registers the whole time! Most compilers have been able to take advantage of this for many years now, but it may require turning on an extra option which some people don't realize is there.

    • On Visual Studio, enable /arch:SSE2 by default. Do it on both release and debug builds. (not required on 64-bit as it's already the default)
  2. SSA form
    In the past 10 years most of the major compiler vendors have moved to SSA representations (Static Single Assignment) for their intermediate trees, which mean that data flow is greatly improved throughout the whole compiler. Compilers now take any chance they can get to avoid redundant data transfer.

  3. vectorcall
    This is the key. The default calling convention for a function is "cdecl" (or "fastcall" on 64-bit). These conventions don't really care much for SSE registers; if you call a function that uses SSE registers, here's what happens:

    • The SSE register gets written out onto the stack
    • The address of the SSE register gets written out onto the stack
    • The function gets called
    • The address of the SSE register gets read off the stack
    • The SSE register gets read off the stack

    Yeerugh. That's horrible. That's 4 extra steps per function argument. Surely there's something better we can do?

    In 2013 Microsoft introduced the Vector Calling Convention to address this. The idea of vectorcall is that SIMD values stay in SSE registers the whole time - they never need to go back and forth to memory (except of course when you actually finish with them).

    You can go through your codebase and tag every function with the new __vectorcall keyword if you want, but that's kinda stupid. Enable /Gv in the compiler options and it'll magically take hold everywhere.

Note

If you're using C++, /Gv does not apply by default to member functions. You'll have to annotate them manually. This doesn't apply to the ones in the float3 type itself because they're force-inlined.


So, how do we put all this together and write our new maths library with it in mind? I've written a small sample application which you can download at the end. First I'll walk you through each section (that-there literate programming I've heard so much about), and then we'll inspect the generated assembly to see how well it does.

You'll note that we aim to follow HLSL syntax whenever possible. HLSL is the one true syntax. There's no place in this world for things like QVector3D::dotProduct() any more. Unfortunately the __m128 type prevents us from using the HLSL swizzle/access syntax, which sucks. It's not the end of the world though, we'll live.


Compiler options

I'm only going to cover Visual Studio here, but the same principles apply to gcc/clang. Note that these options should be applied to both release and debug builds.

And of course, for release builds you'll want to enable optimization (!). I only mention that because occasionally I see new programmers who don't know about it.

There's a bunch more stuff you can tweak but that's your business. This is the minimum.

Headers

The most important header is <xmmintrin.h>, which provides the __m128 SSE type and the _mm_blah() functions which operate on it. You'll also need <math.h> for the regular stuff like sqrt(), and <stdint.h>. When designing libraries, it's important not to pull in headers that you don't need. Forward-declare types when you can.

Many game programmers are used to defining their own integer types, as C++ never used to do it for you. Well, we have <stdint.h> now, so use it. There's no place any more for your own Uint32 types. It's time to move on.

Also note the use of #pragma once. All C++ compilers you care about now support this (and even most of the ones you don't), so you can skip the include guards.

I realise I've strayed off the topic of maths here, but sometimes when you explain things it's good to color over the edges a little.

#pragma once
#include <stdint.h>
#include <math.h>
#include <xmmintrin.h>

Setup

Inlining is critical a maths library. Not just for performance, but... OK, this is going to be hard to explain. It's about the essence of what's being generated. Many people think that the disassembly of some code is just an artifact - a byproduct of the compilation that you might occasionally take a peep at, if the fancy takes you. Yet the quality of the disassembly is important in it's own right. The ability to step and inspect at that low level is critical.

Arbitrary inlining in general can make the disassembly hard to read, so don't rely on the compiler's own heuristics (i.e. guesses) as to whether to inline something. If you want it inlined, mark it as such, otherwise don't. On Visual Studio, set the /Ob1 option to achieve this, and then use __forceinline for anything you want inlined. Do this even in debug builds.

For tiny functions, inlining actually provides a clearer disassembly, and cleaner code flow. If your function is more than 4 lines however, you probably shouldn't have inlined it. Calls are cheap, especially now that we're using our lovely vectorcall to avoid all the vector-passing-baggage we used to have.

We'll also define a few helper macros here. We could include <float.h> and <limits.h> to get INT_MAX/FLT_MAX, but seeing as it's just a few lines we should just define it ourselves.

#define VM_INLINE   __forceinline
#define M_PI        3.14159265358979323846f
#define DEG2RAD(_a) ((_a)*M_PI/180.0f)
#define RAD2DEG(_a) ((_a)*180.0f/M_PI)
#define INT_MIN     (-2147483647 - 1)
#define INT_MAX     2147483647
#define FLT_MAX     3.402823466e+38F

Our Vector Type

First we'll just define a little helper for rearranging values.
Ideally we'd just use HLSL swizzles, but that's not an option.

// Shuffle helpers.
// Examples: SHUFFLE3(v, 0,1,2) leaves the vector unchanged.
//           SHUFFLE3(v, 0,0,0) splats the X coord out.
#define SHUFFLE3(V, X,Y,Z) float3(_mm_shuffle_ps((V).m, (V).m, _MM_SHUFFLE(Z,Z,Y,X)))

Now let's define our vector type. We'll just cover float3 here, but in a production library you'd also want float2, float4, and possibly int2-4 types too (although integer types probably don't need to be SIMD).

Always make constructors explicit (except for the default constructor). It'll save you many headaches down the line. Try to be generous in what you provide - remember the goal is to make it easy for the users. I haven't listed it, but you'll probably also want one that takes a float2+Z value and joins them.

struct float3
{
    // Constructors.
    VM_INLINE float3() {}
    VM_INLINE explicit float3(const float *p) { m = _mm_set_ps(p[2], p[2], p[1], p[0]); }
    VM_INLINE explicit float3(float x, float y, float z) { m = _mm_set_ps(z, z, y, x); }
    VM_INLINE explicit float3(__m128 v) { m = v; }

As I mentioned, the __m128 type doesn't provide any decent direct access to the X/Y/Z components, so we'll need wrappers. I really hate this, but what can you do. It's a small evil so we'll have to just live with it. You might be tempted to try defining the float3 struct as a union instead, but that'll just screw with the compiler's data flow analysis.

    VM_INLINE float x() const { return _mm_cvtss_f32(m); }
    VM_INLINE float y() const { return _mm_cvtss_f32(_mm_shuffle_ps(m, m, _MM_SHUFFLE(1, 1, 1, 1))); }
    VM_INLINE float z() const { return _mm_cvtss_f32(_mm_shuffle_ps(m, m, _MM_SHUFFLE(2, 2, 2, 2))); }

Let's make up for the lack of HLSL swizzles by defining some of the common ones manually:

    VM_INLINE float3 yzx() const { return SHUFFLE3(*this, 1, 2, 0); }
    VM_INLINE float3 zxy() const { return SHUFFLE3(*this, 2, 0, 1); }

Occasionally we might want to store data out to 3 unaligned floats, so let's add something for that.

    VM_INLINE void store(float *p) const { p[0] = x(); p[1] = y(); p[2] = z(); }

Some maths librarys provide writeable accessors using references. (e.g. x() = 4;). This is a really stupid idea. References are just going to cause more loads/stores, and if your vector type has that kind of direct access anyway then you should have just exposed the X/Y/Z variables as a simple C struct.

But, we do still need occasionally to write to the components. It's actually not as common as you might think - you should almost always prefer to use a float3() constructor to construct a new vector, rather than taking an existing one and poking it. Still, let's define some write access functions.

Note that _mm_set_ss is just a no-op (i.e. it's just a cast, it doesn't generate any actual code). So setX() is actually just a single instruction.

    void setX(float x)
    {
        m = _mm_move_ss(m, _mm_set_ss(x));
    }
    void setY(float y)
    {
        __m128 t = _mm_move_ss(m, _mm_set_ss(y));
        t = _mm_shuffle_ps(t, t, _MM_SHUFFLE(3, 2, 0, 0));
        m = _mm_move_ss(t, m);
    }
    void setZ(float z)
    {
        __m128 t = _mm_move_ss(m, _mm_set_ss(z));
        t = _mm_shuffle_ps(t, t, _MM_SHUFFLE(3, 0, 1, 0));
        m = _mm_move_ss(t, m);
    }

Some maths libraries use index notation for all accesses (e.g. float f = v[1];). Don't do this. Use .x(), that's what it's for. The compiler will have a very hard time trying to treat the SIMD type as an array. However there are occasions when you actually do need indexing access, where the index isn't known at compile time. So let's add some access functions. They'll be slow, as it may cause a memory spill, but there we are.

    VM_INLINE float operator[] (size_t i) const { return m.m128_f32[i]; };
    VM_INLINE float& operator[] (size_t i) { return m.m128_f32[i]; };

And finally, our actual storage. We just embed the SSE type directly.

    __m128 m;
};

Important

There's a little subtlety here. In order for __vectorcall to work properly, we have to design our float3 type around this single __m128 type. If you try and mess with it, try and wrap it in a union, or try and stick anything else in this struct, you'll break the whole thing.

We'll also want one extra constructor here - one for constructing a float3 from 3 integers. It's not essential but it can save a lot of manually casting things about. It needs a different name so it doesn't get confused with the regular float3 constructor.

VM_INLINE float3 float3i(int x, int y, int z) { return float3((float)x, (float)y, (float)z); }

Phew! OK that's the first part done. In a sane world that would have just been a built-in primitive in the language, like HLSL has.

Static Construction

Occasionally you need to declare data globally. For float3's, you can just use a global constructor (e.g. float3 g_value(1,2,3);). However some advanced usages might need a little more, so we declare a static construction helper here:

#define VCONST extern const __declspec(selectany)
struct vconstu
{
    union { uint32_t u[4]; __m128 v; };
    VM_INLINE operator __m128() const { return v; }
};

VCONST vconstu vsignbits = { 0x80000000, 0x80000000, 0x80000000, 0x80000000 };

This construct allows us to globally define SIMD data types that we can read as either integers or floating point. This is great for when you need to manually fiddle the the IEEE representation of floats. You can ignore this whole part for now though.

Operators

One of the nicest things about HLSL is that it separates the types from the functions that operate on them. Let's move on to our operators.

typedef float3 bool3;

Normally when you compare two values, you get a bool. SIMD comparison operators don't return a single value however, so we need a bool3 type. For today's lesson I'm just going to make that a float3 too, but if you go nuts with this then you might want to actually implement it as its own type.

Let's define our binary operators:

VM_INLINE float3 operator+ (float3 a, float3 b) { a.m = _mm_add_ps(a.m, b.m); return a; }
VM_INLINE float3 operator- (float3 a, float3 b) { a.m = _mm_sub_ps(a.m, b.m); return a; }
VM_INLINE float3 operator* (float3 a, float3 b) { a.m = _mm_mul_ps(a.m, b.m); return a; }
VM_INLINE float3 operator/ (float3 a, float3 b) { a.m = _mm_div_ps(a.m, b.m); return a; }
VM_INLINE float3 operator* (float3 a, float b) { a.m = _mm_mul_ps(a.m, _mm_set1_ps(b)); return a; }
VM_INLINE float3 operator/ (float3 a, float b) { a.m = _mm_div_ps(a.m, _mm_set1_ps(b)); return a; }
VM_INLINE float3 operator* (float a, float3 b) { b.m = _mm_mul_ps(_mm_set1_ps(a), b.m); return b; }
VM_INLINE float3 operator/ (float a, float3 b) { b.m = _mm_div_ps(_mm_set1_ps(a), b.m); return b; }
VM_INLINE float3& operator+= (float3 &a, float3 b) { a = a + b; return a; }
VM_INLINE float3& operator-= (float3 &a, float3 b) { a = a - b; return a; }
VM_INLINE float3& operator*= (float3 &a, float3 b) { a = a * b; return a; }
VM_INLINE float3& operator/= (float3 &a, float3 b) { a = a / b; return a; }
VM_INLINE float3& operator*= (float3 &a, float b) { a = a * b; return a; }
VM_INLINE float3& operator/= (float3 &a, float b) { a = a / b; return a; }
VM_INLINE bool3 operator==(float3 a, float3 b) { a.m = _mm_cmpeq_ps(a.m, b.m); return a; }
VM_INLINE bool3 operator!=(float3 a, float3 b) { a.m = _mm_cmpneq_ps(a.m, b.m); return a; }
VM_INLINE bool3 operator< (float3 a, float3 b) { a.m = _mm_cmplt_ps(a.m, b.m); return a; }
VM_INLINE bool3 operator> (float3 a, float3 b) { a.m = _mm_cmpgt_ps(a.m, b.m); return a; }
VM_INLINE bool3 operator<=(float3 a, float3 b) { a.m = _mm_cmple_ps(a.m, b.m); return a; }
VM_INLINE bool3 operator>=(float3 a, float3 b) { a.m = _mm_cmpge_ps(a.m, b.m); return a; }
VM_INLINE float3 min(float3 a, float3 b) { a.m = _mm_min_ps(a.m, b.m); return a; }
VM_INLINE float3 max(float3 a, float3 b) { a.m = _mm_max_ps(a.m, b.m); return a; }

Notice that they're all inline. You never, ever, want to actually step into any of these functions.

There's also the unary operators:

VM_INLINE float3 operator- (float3 a) { return float3(_mm_setzero_ps()) - a; }
VM_INLINE float3 abs(float3 v) { v.m = _mm_andnot_ps(vsignbits, v.m); return v; }

Note the clever trick for abs(): SSE has no abs instruction, so instead we just mask off the sign bits ourselves. Isn't bit-twiddling great?

Helpers

That's the core of the work done. We need a few nice helper functions too.

Horizontal min/max is very useful:

VM_INLINE float hmin(float3 v) {
    v = min(v, SHUFFLE3(v, 1, 0, 2));
    return min(v, SHUFFLE3(v, 2, 0, 1)).x();
}

VM_INLINE float hmax(float3 v) {
    v = max(v, SHUFFLE3(v, 1, 0, 2));
    return max(v, SHUFFLE3(v, 2, 0, 1)).x();
}

No 3D maths library would be complete without the vector cross product. This is specific to float3; float2/4 don't have this.

VM_INLINE float3 cross(float3 a, float3 b) {
    // x  <-  a.y*b.z - a.z*b.y
    // y  <-  a.z*b.x - a.x*b.z
    // z  <-  a.x*b.y - a.y*b.x
    // We can save a shuffle by grouping it in this wacky order:
    return (a.zxy()*b - a*b.zxy()).zxy();
}

I mentioned earlier that comparisons don't return a single value, they return a vector of booleans. How do we get this into a form we can actually work with?

The first step is to be able to construct an integer mask from the sign bits of each result:

// Returns a 3-bit code where bit0..bit2 is X..Z
VM_INLINE unsigned mask(float3 v) { return _mm_movemask_ps(v.m) & 7; }

This mask can sometimes be useful on its own, but we can also use it to write some simple comparison helpers:

// Once we have a comparison, we can branch based on its results:
VM_INLINE bool any(bool3 v) { return mask(v) != 0; }
VM_INLINE bool all(bool3 v) { return mask(v) == 7; }

And finally we can fill out the rest of the HLSL standard library. I haven't listed all of them, there's a few more exotic ones you may or may not want, but this should get you started:

VM_INLINE float3 clamp(float3 t, float3 a, float3 b) { return min(max(t, a), b); }
VM_INLINE float sum(float3 v) { return v.x() + v.y() + v.z(); }
VM_INLINE float dot(float3 a, float3 b) { return sum(a*b); }
VM_INLINE float length(float3 v) { return sqrtf(dot(v, v)); }
VM_INLINE float lengthSq(float3 v) { return dot(v, v); }
VM_INLINE float3 normalize(float3 v) { return v * (1.0f / length(v)); }
VM_INLINE float3 lerp(float3 a, float3 b, float t) { return a + (b-a)*t; }

Well that's that. A simple SIMD float3 library for you. It might seem like a lot, but that's really only ~150 lines of code. I've left a matrix library as an exercise to the reader, but once you have float3/float4 in place, the rest can be based off that.

Note that the matrix library only needs to be written in terms of float3/float4 - you don't need to write any SSE intrinsics there. Also remember to only define your matrix struct as containing a simple array of float3's, otherwise you'll break vectorcall.


The Test Case

For our test, we'll try a simple non-toy example. Here's the code to intersect a ray and an AABB:

bool intersectRayBox(float3 rayOrg, float3 invDir, float3 bbmin, float3 bbmax, float &hitT)
{
    float3 d0 = (bbmin - rayOrg) * invDir;
    float3 d1 = (bbmax - rayOrg) * invDir;

    float3 v0 = min(d0, d1);
    float3 v1 = max(d0, d1);

    float tmin = hmax(v0);
    float tmax = hmin(v1);

    bool hit = (tmax >= 0) && (tmax >= tmin) && (tmin <= hitT);
    if (hit)
        hitT = tmin;
    return hit;
}

And here's the code which calls it:

float hitT = FLT_MAX;
bool hit = intersectRayBox(testorg, float3(1, 1, 1) / testdir, testbbmin, testbbmax, hitT);
printf("hit %i at t=%f\n", hit, hitT);

Notice how although we're using intrinsics, our code remains very clean and readable. Unlike trying to write directly with SSE intrinsics, we haven't compromised at all on our design.

So what assembly does this generate? In the demo I've provided two different implementations. The first is a simple non-SSE version, using a simple struct vec3 { float x, y, z; }; type. Let's see how it holds up. Note that I've passed the parameters using references in this case, to avoid as many redundant copies as possible.

Traditional non-SSE implementation (ray-box code)

intersectRayBoxFP:
    sub         esp,18h  
    mov         eax,dword ptr [ecx+8]  
    movq        xmm0,mmword ptr [ecx]  
    mov         dword ptr [esp+14h],eax  
    mov         eax,dword ptr [esp+1Ch]  
    movq        mmword ptr [esp+0Ch],xmm0  
    movq        xmm0,mmword ptr [eax]  
    mov         eax,dword ptr [eax+8]  
    mov         dword ptr [esp+8],eax  
    movss       xmm3,dword ptr [esp+8]  
    subss       xmm3,dword ptr [esp+14h]  
    mov         eax,dword ptr [edx+8]  
    movq        mmword ptr [esp],xmm0  
    movss       xmm1,dword ptr [esp]  
    subss       xmm1,dword ptr [esp+0Ch]  
    movss       xmm2,dword ptr [esp+4]  
    subss       xmm2,dword ptr [esp+10h]  
    movq        xmm0,mmword ptr [edx]  
    mov         dword ptr [esp+14h],eax  
    mov         eax,dword ptr [ecx+8]  
    movss       xmm5,dword ptr [esp+14h]  
    movq        mmword ptr [esp+0Ch],xmm0  
    movq        xmm0,mmword ptr [ecx]  
    movss       xmm4,dword ptr [esp+0Ch]  
    movss       xmm6,dword ptr [esp+10h]  
    mov         dword ptr [esp+8],eax  
    mov         eax,dword ptr [esp+20h]  
    movq        mmword ptr [esp],xmm0  
    mulss       xmm4,xmm1  
    movq        xmm0,mmword ptr [eax]  
    mov         eax,dword ptr [eax+8]  
    movq        mmword ptr [esp+0Ch],xmm0  
    movss       xmm1,dword ptr [esp+0Ch]  
    subss       xmm1,dword ptr [esp]  
    movq        xmm0,mmword ptr [edx]  
    mulss       xmm6,xmm2  
    mov         dword ptr [esp+14h],eax  
    movss       xmm2,dword ptr [esp+10h]  
    subss       xmm2,dword ptr [esp+4]  
    mov         eax,dword ptr [edx+8]  
    movq        mmword ptr [esp+0Ch],xmm0  
    movss       xmm7,dword ptr [esp+0Ch]  
    movss       xmm0,dword ptr [esp+10h]  
    mulss       xmm5,xmm3  
    movss       xmm3,dword ptr [esp+14h]  
    subss       xmm3,dword ptr [esp+8]  
    mulss       xmm7,xmm1  
    mov         dword ptr [esp+14h],eax  
    mulss       xmm0,xmm2  
    comiss      xmm7,xmm4  
    movss       xmm2,dword ptr [esp+14h]  
    movss       dword ptr [esp+1Ch],xmm7  
    mulss       xmm2,xmm3  
    jbe         intersectRayBoxFP+101h (0AA11A1h)  
    movaps      xmm1,xmm4  
    jmp         intersectRayBoxFP+104h (0AA11A4h)  
    movaps      xmm1,xmm7  
    comiss      xmm0,xmm6  
    jbe         intersectRayBoxFP+10Eh (0AA11AEh)  
    movaps      xmm7,xmm6  
    jmp         intersectRayBoxFP+111h (0AA11B1h)  
    movaps      xmm7,xmm0  
    comiss      xmm2,xmm5  
    jbe         intersectRayBoxFP+11Bh (0AA11BBh)  
    movaps      xmm3,xmm5  
    jmp         intersectRayBoxFP+11Eh (0AA11BEh)  
    movaps      xmm3,xmm2  
    comiss      xmm4,dword ptr [esp+1Ch]  
    ja          intersectRayBoxFP+12Bh (0AA11CBh)  
    movss       xmm4,dword ptr [esp+1Ch]  
    comiss      xmm6,xmm0  
    ja          intersectRayBoxFP+133h (0AA11D3h)  
    movaps      xmm6,xmm0  
    comiss      xmm5,xmm2  
    ja          intersectRayBoxFP+13Bh (0AA11DBh)  
    movaps      xmm5,xmm2  
    comiss      xmm1,xmm7  
    ja          intersectRayBoxFP+143h (0AA11E3h)  
    movaps      xmm1,xmm7  
    comiss      xmm1,xmm3  
    ja          intersectRayBoxFP+14Bh (0AA11EBh)  
    movaps      xmm1,xmm3  
    comiss      xmm6,xmm4  
    ja          intersectRayBoxFP+153h (0AA11F3h)  
    movaps      xmm4,xmm6  
    comiss      xmm5,xmm4  
    ja          intersectRayBoxFP+15Bh (0AA11FBh)  
    movaps      xmm4,xmm5  
    comiss      xmm4,dword ptr ds:[0AAD198h]  
    jb          intersectRayBoxFP+182h (0AA1222h)  
    comiss      xmm4,xmm1  
    jb          intersectRayBoxFP+182h (0AA1222h)  
    mov         ecx,dword ptr [esp+24h]  
    movss       xmm0,dword ptr [ecx]  
    comiss      xmm0,xmm1  
    jb          intersectRayBoxFP+182h (0AA1222h)  
    mov         al,1  
    movss       dword ptr [ecx],xmm1  
    add         esp,18h  
    ret         0Ch  
    xor         al,al  
    add         esp,18h  
    ret         0Ch

Bleurgh. That's over 100 instructions. There's a ridiculous amount of comparison instructions towards the end, which is mostly due to our naive implementation of min/max. I've included a variation which uses the SSE scalar min/max intrinsics to help, but honestly it's not much better.

Let's also take a look at the code which calls it:

Traditional non-SSE implementation (usage)

testFP:
    sub         esp,1Ch  
    movq        xmm0,mmword ptr ds:[1A200Ch]  
    lea         edx,[esp+4]  
    movss       xmm1,dword ptr ds:[19D19Ch]  
    mov         ecx,1A2000h  
    mov         eax,dword ptr ds:[001A2014h]  
    movaps      xmm2,xmm1  
    movq        mmword ptr [esp+4],xmm0  
    movaps      xmm0,xmm1  
    divss       xmm2,dword ptr [esp+4]  
    mov         dword ptr [esp+0Ch],eax  
    mov         dword ptr [esp],7F7FFFFFh  
    divss       xmm1,dword ptr [esp+0Ch]  
    divss       xmm0,dword ptr [esp+8]  
    movss       dword ptr [esp+18h],xmm1  
    mov         eax,dword ptr [esp+18h]  
    mov         dword ptr [esp+0Ch],eax  
    lea         eax,[esp]  
    push        eax  
    push        1A2024h  
    unpcklps    xmm2,xmm0  
    push        1A2018h  
    movq        mmword ptr [esp+10h],xmm2  
    call        intersectRayBoxFP (01910A0h)  
    movss       xmm0,dword ptr [esp]  
    sub         esp,8  
    cvtps2pd    xmm0,xmm0  
    movzx       eax,al  
    movsd       mmword ptr [esp],xmm0  
    push        eax  
    push        19D188h  
    call        printf (0191352h)  
    add         esp,2Ch  
    ret

That's about 30 instructions to do the setup, then about 5 or so to do the printf. Not great.


Our New Results

Now let's try the same test with the shiny new SSE type. We're using exactly the same test code, the only difference is we pass our vectors in by value, not by reference. This is important for vectorcall to do its magic.

Vectorcall SSE implementation (ray-box code)

intersectRayBox:
    subps       xmm2,xmm0  
    subps       xmm3,xmm0  
    mulps       xmm2,xmm1  
    mulps       xmm3,xmm1  
    movaps      xmm1,xmm2  
    minps       xmm1,xmm3  
    maxps       xmm2,xmm3  
    movaps      xmm0,xmm1  
    shufps      xmm0,xmm1,0A1h  
    maxps       xmm1,xmm0  
    movaps      xmm0,xmm1  
    shufps      xmm0,xmm1,52h  
    maxps       xmm1,xmm0  
    movaps      xmm0,xmm2  
    shufps      xmm0,xmm2,0A1h  
    minps       xmm2,xmm0  
    movaps      xmm0,xmm2  
    shufps      xmm0,xmm2,52h  
    minps       xmm2,xmm0  
    comiss      xmm2,dword ptr ds:[3CD198h]  
    jb          intersectRayBox+5Bh (03C109Bh)  
    comiss      xmm2,xmm1  
    jb          intersectRayBox+5Bh (03C109Bh)  
    movss       xmm0,dword ptr [ecx]  
    comiss      xmm0,xmm1  
    jb          intersectRayBox+5Bh (03C109Bh)  
    mov         al,1  
    movss       dword ptr [ecx],xmm1  
    ret  
    xor         al,al  
    ret

Wowzers! Is that it? That's 32 instructions!

It's... it's literally perfect. There's not a single memory load/store anywhere in there, not even in the function prologue. Well OK, there's one to store the final 't' value out, and there's one to load a constant zero value. But that's pretty darn good. I'd say this easily rivals any hand-written assembler.

Let's take a look at the code that calls it:

Vectorcall SSE implementation (usage)

testSSE:
    push        ecx  
    movaps      xmm1,xmmword ptr ds:[3CD1C0h]  
    lea         ecx,[esp]  
    divps       xmm1,xmmword ptr ds:[3D30F0h]  
    mov         dword ptr [esp],7F7FFFFFh  
    movaps      xmm0,xmmword ptr ds:[3D3100h]  
    movaps      xmm2,xmmword ptr ds:[3D30E0h]  
    movaps      xmm3,xmmword ptr ds:[3D3110h]  
    call        intersectRayBox (03C1040h)  
    movss       xmm0,dword ptr [esp]  
    sub         esp,8  
    cvtps2pd    xmm0,xmm0  
    movzx       eax,al  
    movsd       mmword ptr [esp],xmm0  
    push        eax  
    push        3CD188h  
    call        printf (03C1352h)  
    add         esp,14h  
    ret

It's like night and day. See how much shorter it is? Again there's no fat here, no unneeded instructions that should have been trimmed off. It's about 15 instructions for the body, and 5 to call the printf.


I'll tease you with one more little example. Let's say you have a function which returns a matrix, and a second function which accepts a matrix as a parameter.

Normally, you might choose to not actually return the matrix, instead passing it by reference/pointer and having the first function fill that in. Then you pass the reference into the second function. Matrices are pretty big, so you really want to avoid the copies.

void calcMatrix(float4x4 &output);
void updateThing(const float4x4 &mtx);

With vectorcall, you can actually just return the matrix as a return value, matching the HLSL style:

float4x4 calcMatrix();
void updateThing(float4x4 mtx);

And here's the kind of assembly you'd get from doing that:

call        calcMatrix (5614AC10h)  
call        updateThing (5614AA50h)

There's no copies at all! It all just vanishes! The results from calcMatrix stay in the XMM0-XMM3 registers, and as we call updateThing they're already in the right place. We don't have to do any work at all, not even passing the address of the matrix. It literally does not get any better than that.


Caveats

There is one annoying thing I have to mention. When using a non-standard calling convention enabled globally, any DLLs you link against need to be aware of it. All the built-in Windows DLLs are, but many third party libraries are not. So if you want to use something like, for example, Lua, then you may need to just go through the library's header file and stick a "__cdecl" in front of everything.

The alternative is to manually add __vectorcall to every function in your codebase, but that's silly. Don't clutter your code, let the compiler clutter things for you.

If you're linking libraries statically, none of this affects you.

Wrap Up

Well there you have it ladies and gentlemen. If you're after a modern, fast, beautiful vector maths library, here is the way to do it.

In the past, trying to use SSE to accelerate a vector maths library was usually a waste of both your own time and the CPUs. But time does not stand still, and yesterdays rules may be different than todays.

So my advice today is to give vectorcall a try. But remember, while the advice from 10 years ago may not apply today, please also bear in mind that in 10 years time this article will be wrong too.

Written by Richard Mitton,

software engineer and travelling wizard.

Follow me on twitter: http://twitter.com/grumpygiant