Worth mentioning that MeshOptimizer (https://github.com/zeux/meshoptimizer) has become one of a handful 'hidden champion' pillar libraries that probably carries half of the gaming industry.
While we could utilize zigzag encoding (i>>31) ^ (i<<1) to convert SLEB128-encoded type/addend to use ULEB128 instead, the generate code is inferior to or on par with SLEB128 for one-byte encodings on x86, AArch64, and RISC-V. Haven't tried wider values - but zigzag encoding is likely slower as well
// One-byte case for SLEB128
int64_t from_signext(uint64_t v) {
return v < 64 ? v - 128 : v;
}
// One-byte case for ULEB128 with zig-zag encoding
int64_t from_zigzag(uint64_t z) {
return (z >> 1) ^ -(z & 1);
}
Zigzag encodings are a common compression scheme used in the Parquet format. It is fun to speculate that these kind of tricks could be applied there in something so commonly under the hood of a lot of data processing and analytics
Is the matrix for bit shifting upside down or am I momentarily making a really dumb mistake here? Edit: nvm I missed the footnote which clarifies that for whatever reason the instruction populates the matrix from bottom to top.
Now why can't compilers do this sort of thing automatically?
Almost any problem seems to be possible to speed up 1000x in AVX512+days of thought compared to the naive version written in a python loop. If we could automate that whole process for big codebases the performance gains could be huge.
Compilers can’t really, in a meaningful way, change the layout of your data in memory. And you do need to think about your memory layout to get any benefit from SIMD. You’ll notice a lot of compiler auto vectorization insert many instructions just to shuffle data around to get to a usable layout, which negates much of the benefit.
Even when the layout is friendly to simd, auto vectorization can be finicky. As a programmer, it’s really annoying to be constantly inspecting compiler output to see if the code was properly vectorized. Even if it was, slight changes or compiler updates can throw the whole thing off. Auto vectorization is nice when you get performance improvements for “free”, but I find it fragile for the really critical parts where you absolutely need it to be vectorized.
I often wonder about a macro-like thing where we could write a function using a subset of the language that’s simd aware. A bit higher level than using intrinsics or those simd libs
This is actually a very nice question and the answer is that interpreted languages with a JIT benefit from this.
One example is Java, which will happily vectorize your code into AVX or SSE where possible.
Python just got a JIT compiler and we’ll start seeing the same thing soon.
But as someone else said here, some constructs don’t translate well and adding transformations to show vectorization would negate the perfomance gains.
Sad that the compiler (even Java) can’t explain you this and warn about it, but now with LLM, maybe they’ll start doing things like that soon.
Are there any programming languages which change the data layout beyond naively sorting struct members by alignment? (which at best helps with reducing padding bytes but can be either good or bad for performance, depending on the code which accesses the data).
One simple optimization is to change arrays of struts into struts of arrays. To my knowledge, nothing even makes those changes, despite them being safe and having a huge potential performance benefit.
Zig also sorts struct members by size/alignment, but has two escape hatches ('extern struct' which is for C compatibility, and 'packed struct' which offers an explicit bit-by-bit memory layout).
AFAIK Odin and Jai offer the SoA transform as specialized language features, e.g. in Odin:
FORTRAN is used for a lot of numerical algorithms - today! installed on your computer right now in some library! - because it optimizes better than C because it doesn't have pointers.
> 1000x in AVX512+days of thought compared to the naive version written in a python loop
Out of this 1000x speedup you get 100x by just not using python though ;)
Also IIRC the main problem specifically with AVX512 was that mainstream CPUs simply didn't have it, so a smart compiler won't be of much use when the output code only runs on a handful devices.
a pragmatic approach: write in a high level interpreted language that rhymes with modern CPUs, vector extensions, memory bandwidth
e.g. apl [0], bqn [1], k [2], kiwi [3]
- vectors are dense (not boxed)
- optimized internal representation (e.g. bitpacked bool vectors)
- primitives act on vectors + use avx, neon if possible
Intents matter. Compilers can't see through your skull to infer your intents and thus behave very conservatively unless you override that behavior somehow. This inference, alas, also takes (much) time, so compilers have to balance the compilation time with quality of intents guessed as well. (This is why we can't exactly use LLMs in mainstream compilers, by the way.) So go and make a programming language that preserves your intents by every means; but making it practical would be very difficult.
21 comments:
Worth mentioning that MeshOptimizer (https://github.com/zeux/meshoptimizer) has become one of a handful 'hidden champion' pillar libraries that probably carries half of the gaming industry.
Basically the curl of asset pipelines ;)
https://github.com/zeux/meshoptimizer/discussions/986
While we could utilize zigzag encoding (i>>31) ^ (i<<1) to convert SLEB128-encoded type/addend to use ULEB128 instead, the generate code is inferior to or on par with SLEB128 for one-byte encodings on x86, AArch64, and RISC-V. Haven't tried wider values - but zigzag encoding is likely slower as well
// One-byte case for SLEB128 int64_t from_signext(uint64_t v) { return v < 64 ? v - 128 : v; }
// One-byte case for ULEB128 with zig-zag encoding int64_t from_zigzag(uint64_t z) { return (z >> 1) ^ -(z & 1); }
Zigzag encodings are a common compression scheme used in the Parquet format. It is fun to speculate that these kind of tricks could be applied there in something so commonly under the hood of a lot of data processing and analytics
Is the matrix for bit shifting upside down or am I momentarily making a really dumb mistake here? Edit: nvm I missed the footnote which clarifies that for whatever reason the instruction populates the matrix from bottom to top.
This sort of analysis is great.
Now why can't compilers do this sort of thing automatically?
Almost any problem seems to be possible to speed up 1000x in AVX512+days of thought compared to the naive version written in a python loop. If we could automate that whole process for big codebases the performance gains could be huge.
Compilers can’t really, in a meaningful way, change the layout of your data in memory. And you do need to think about your memory layout to get any benefit from SIMD. You’ll notice a lot of compiler auto vectorization insert many instructions just to shuffle data around to get to a usable layout, which negates much of the benefit.
Even when the layout is friendly to simd, auto vectorization can be finicky. As a programmer, it’s really annoying to be constantly inspecting compiler output to see if the code was properly vectorized. Even if it was, slight changes or compiler updates can throw the whole thing off. Auto vectorization is nice when you get performance improvements for “free”, but I find it fragile for the really critical parts where you absolutely need it to be vectorized.
I often wonder about a macro-like thing where we could write a function using a subset of the language that’s simd aware. A bit higher level than using intrinsics or those simd libs
This is actually a very nice question and the answer is that interpreted languages with a JIT benefit from this.
One example is Java, which will happily vectorize your code into AVX or SSE where possible.
Python just got a JIT compiler and we’ll start seeing the same thing soon.
But as someone else said here, some constructs don’t translate well and adding transformations to show vectorization would negate the perfomance gains.
Sad that the compiler (even Java) can’t explain you this and warn about it, but now with LLM, maybe they’ll start doing things like that soon.
Depends on the programming language. A good question is why we don't have more optimizable languages in mainstream use.
Are there any programming languages which change the data layout beyond naively sorting struct members by alignment? (which at best helps with reducing padding bytes but can be either good or bad for performance, depending on the code which accesses the data).
One simple optimization is to change arrays of struts into struts of arrays. To my knowledge, nothing even makes those changes, despite them being safe and having a huge potential performance benefit.
Now that you mention it... :)
Zig has MultiArrayList in the stdlib which does the SoA transform via comptime:
https://ziglang.org/documentation/master/std/#std.multi_arra...
Zig also sorts struct members by size/alignment, but has two escape hatches ('extern struct' which is for C compatibility, and 'packed struct' which offers an explicit bit-by-bit memory layout).
AFAIK Odin and Jai offer the SoA transform as specialized language features, e.g. in Odin:
https://odin-lang.org/docs/overview/#soa-data-types
I'd still always want such data layout transforms as an explicit language feature though, not the compiler making this decision for me.
Halide comes to mind (though it a DSL and not a full independent language).
I wonder if Futhark does? Eg https://futhark-lang.org/student-projects/pedersen-nelin-msc...
various SQLs and APLs come to mind :) the industry still has a lot to learn from them both
FORTRAN is used for a lot of numerical algorithms - today! installed on your computer right now in some library! - because it optimizes better than C because it doesn't have pointers.
As I understand it most of the difference can be made up by adding the restrict qualifier to everything in C.
> 1000x in AVX512+days of thought compared to the naive version written in a python loop
Out of this 1000x speedup you get 100x by just not using python though ;)
Also IIRC the main problem specifically with AVX512 was that mainstream CPUs simply didn't have it, so a smart compiler won't be of much use when the output code only runs on a handful devices.
> Now why can't compilers do this sort of thing automatically?
They do - they just can't assume GFNI instructions are present unless you explicitly say so: https://godbolt.org/z/eYasbKsse
> Now why can't compilers do this sort of thing automatically?
Because they are not query compilers, ie: They don't know the data.
For example a query compiler could swap index to full scan because it "see" (by runtime statistics) the data not benefit for it.
In the other hand, an optimization here can pessimism there. So optimizers in general should be very conservative because butterfly effects!
it is not easy for a compiler to vectorize
a pragmatic approach: write in a high level interpreted language that rhymes with modern CPUs, vector extensions, memory bandwidth
e.g. apl [0], bqn [1], k [2], kiwi [3]
[0] https://www.dyalog.com [1] https://mlochbaum.github.io/BQN/ [2] https://kx.com [3] https://kiwilang.comgreat article by marshall on BQN performance compared to C and how to think about it
https://mlochbaum.github.io/BQN/implementation/versusc.html
related:
Intents matter. Compilers can't see through your skull to infer your intents and thus behave very conservatively unless you override that behavior somehow. This inference, alas, also takes (much) time, so compilers have to balance the compilation time with quality of intents guessed as well. (This is why we can't exactly use LLMs in mainstream compilers, by the way.) So go and make a programming language that preserves your intents by every means; but making it practical would be very difficult.