The main target for this are NLEs like Blender. Performance is a large part of the issue. Most users still just create TIFF files per frame before importing them into a "real editor" like Resolve.
Apple may have ASICs for ProRes decoding, and Resolve may be the standard editor that everyone uses.
But this goes beyond what even Apple has, by making it possible to work directly with compressed lossless video on consumer GPUs. You can get hundreds of FPS encoding or decoding 4k 16-bit FFv1 on a 4080, while only reading a few gigabits of video per second, rather than tens and even hundreds of gigabits that SSDs can't keep up. No need to have image degradation when passing intermediate copies between CG programs and editing either.
Yep! Almost finished implementing support in https://ossia.io which is going to become the first open-source cross-platform real-time visuals software to support live scrubbing for VJ use cases, in 4K+ prores files on not that big of a GPU (tested on my laptop 3060) :)
(MilkDrop3, projectm-visualizer/presets-cream-of-the-crop, westurner/vizscan for photosensitive epilepsy)
mapmapteam/mapmap does open source multi-projector mapping. How to integrate e.g. mapmap?
BespokeSynth is a C++ and JUCE based patch bay software modular synth with a "node-based UI" and VST3, LV2, AudioUnit audio plugin support. How to feed BespokeSynth audio and possibly someday video? Pipewire and e.g. Helvum?
I don’t understand the spread of thoughts in your post.
The reason to create image sequences is not because you need to send it to other apps, it’s because you preserve quality and safeguard from crashes.
A crash mid video write out can corrupt a lengthy render. With image sequences you only lose the current frame.
People aren’t going to stop using image sequences even if they stayed in the same app.
And I’m not sure why this applies: “this goes beyond” what Apple has, because they do have hardware support for decoding several compressed codecs (also I’ll note that ProRes is also compressed). Other than streaming, when are you going to need that kind of encode performance? Or what other codecs are you expecting will suddenly pop up by not requiring ASICs?
Also how does this remove degradation when going between apps? Are you envisioning this enables Blender to stream to an NLE without first writing a file to disk?
> A crash mid video write out can corrupt a lengthy render. With image sequences you only lose the current frame.
You wouldn't contain FFv1 in MP4, the only format incompetent enough for such corruption.
Apple has an interest against people using codecs that they get no fees from. And Apple don't have a lossless codec. So they don't offer lossless compressed video acceleration.
The idea is that when working as a part of a team, and you get handed a CG render, you can avoid sending a huge .tar or .zip file full of TIFF which you then decompress, or ProRes which loses quality, particularly when in a linear colorspace like ACEScg.
I’m curious what kind of teams you’re working in that you’re handing compressed archives of image sequences? And using tiff vs EXR (unless you mean purely after compositing)?
Another reason to use image sequences is that it’s easier to re-render just a portion of the sequence easily. Granted this can be done with video too, but has higher overhead.
But even then why does the GPU encoding change the fact that you’d send it to another NLE? I just feel like there are a lots of jump in thought process here.
I thought an industry standard was to use proxy files. Open source editor Shotcut use them for example. Create a low resolution + intra-frame only version of the file for very fast scrubbing, make your edits on that, and when done the edit list is applied to the full resolution rushes to produce the output.
Often but not always. Sometimes you’re just working with proxies directly, audio mixing and the like. VFX workflows, finishing will be online full res often.
But even so everybody is often making their own proxies all the time. There’s a lot of passing around of ProRes Proxy or another intermediate quality format and you still make even lighter proxies locally so NLEs and workstation apps will still benefit from this
Proxy files have issues when doing coloring, greenscreens, effects shots. The bit depth, chroma resolution, primaries/transfer/colorspace gets changed. Basically only really usable when editing.
With this, you don't need proxy files at all.
A lot of the confusion in this thread feels like it comes from thinking in terms of web streaming rather than the workloads this post is targeting.
The article is pretty explicit that this is not about "make Twitch more efficient" or squeezing a bit more perf out of H.264. It is about mezzanine and archival formats that are already way beyond what a single CPU, even a decade old workstation CPU, handles comfortably in real time: 4K/6K/8K+ 16‑bit, FFv1-style lossless, ProRes RAW, huge DPX sequences, etc. People cutting multi‑camera timelines of that kind of material are already on the wrong side of the perf cliff and are often forced into very specific hardware or vendors.
What Vulkan compute buys you here is not "GPUs good, CPUs bad", it is the ability to keep the entire codec pipeline resident on the GPU once the bitstream is there, using the same device that is already doing color, compositing and FX, and to do it in a portable way. FFmpeg’s model is also important: all the hairy parts stay in software (parsing, threading, error handling), and only the hot pixel crunching is offloaded. That makes this much more maintainable than the usual fragile vendor API route and keeps a clean fallback path when hardware is not available.
From a practical angle, this is less about winning a benchmark over a good CPU encoder for 4K H.264, and more about changing what is feasible on commodity hardware: e.g., scrubbing multiple streams of 6K/8K ProRes or FFv1 on a consumer GPU instead of needing a fat workstation or dailies transcoded to lighter proxies. For people doing archival work or high end finishing on a budget, that is a real qualitative change, not just an incremental efficiency tweak.
One of the choke points of all modern video codecs that focus on potential high compression ratios is the arithmetic entropy coding. CABAC for h264 and h265, 16-symbol arithmetic coding for AV1. There is no way to parallelize that AFAIK: the next symbol depends on the previous one. All you can do is a bit of speculative decoding but that doesn’t go very deep. Even when implemented in hardware, the arithmetic decoding is hard to parallelize.
This is especially a choke point when you use these codecs for high quality settings. The prediction and filtering steps later in the decoding pipeline are relatively easy to parallelize.
High throughput CODECs like ProRes don’t use arithmetic coding but a much simpler, table based, coding scheme.
FFv1's range coder has higher complexity than CABAC.
The issue is serialization. Mainstream codecs require that the a block depends on previously decoded blocks. Tiles exist, but they're so much larger, and so rarely used, that they may as well not exist.
> the GPU could be used to encode h264, and apparently yes, but it's not really worth it compared to CPU.
It depends on what you're going for. If you're trying to do the absolute highest fidelity for archiving a blu-ray disk, AMD Epyc reigns supreme. That's because you need a lot of flexibility to really dial in the quality settings. Pirates over at PassThePopcorn obsess over minute differences in quality that I absolutely cannot notice with my eyes, and I'm glad they do! Their encodings look gorgeous. This quality can't be achieved with the silicon of hardware-accelerated encoders, and due to driver limitations (not silicon limitations) also cannot be achieved by CUDA cores / execution engines / etc on GPUs.
But if you're okay with a small amount of quality loss, the optimum move for highest # of simultaneous encodes or fastest FPS encoding is to skip the CPU and GPU "general compute" entirely - going with hardware accelerated encoding can get you 8-30 1080p simultaneous encodes on a very cheap intel iGPU using QSV/VAAPI encoding. This means using special sections of silicon whose sole purpose is to perform H264/H265/etc encoding, or cropping / scaling / color adjustments ... the "hardware accelerators" I'm talking about are generally present in the CPU/iGPU/GPU/SOC, but are not general purpose - they can't be used for CUDA/ROCm/etc. Either they're being used for your video pipeline specifically, or they're not being used at all.
I'm doing this now for my startup and we've tuned it so it uses 0% of the CPU and 0% of the Render/3D engine of the iGPU (which is the most "general purpose" section of the GPU, leaving those completely free for ML models) and only utilizing the Video Engine and Video Enhance engines.
For something like Frigate NVR, that's perfect. You can support a large # of cameras on cheap hardware and your encoding/streaming tasks don't load any silicon used for YOLO, other than adding to overall thermal limits.
Video encoding is a very deep topic. You need to have benchmarks, you need to understand not just "CPU vs GPU" ... but down to which parts of the GPU you're using. There's an incredible amount of optimization you can do for your specific task if you take the time to truly understand the systems level of your video pipeline.
> But if you're okay with a small amount of quality loss,
I wouldn't call it a small quality loss. The hardware encoders are tuned for different priorities like live streaming. They have lower quality and/or much higher bitrate.
> If you're trying to do the absolute highest fidelity for archiving a blu-ray disk, AMD Epyc reigns supreme.
You don't need any special CPU to get the highest fidelity as long as you're willing to wait. For archiving purposes any CPU will do, just be prepared to let it run for a long time.
> You don't need any special CPU to get the highest fidelity as long as you're willing to wait.
Correct, but Epyc "reigns supreme" for anyone caring about performance / total FPS throughput, which is relevant for anyone who cares about TFA at all - the purpose of using GPU is to "go faster", and that's what Epyc offers for use cases that also care about extreme fidelity.
> I wouldn't call it a small quality loss. The hardware encoders are tuned for different priorities like live streaming. They have lower quality and/or much higher bitrate.
Sure. It absolutely depends on your use case. We're using it for RDP/KVM-type video, so for us the quality loss is indeed quite "small". Our users care more about "can I read the text clearly?" and less about color-banding. The hardware accelerators do a great job with text clarity so for our use-case it's not much of a noticeable quality loss. I will admit the colors are very noticeably distorted, but the shapes are correct and the contrast/sharpness is good.
Using 0% of the CPU and GPU for encoding is a HUGE win that's totally worth it for us - hardware costs stay super low. Using really old bottom of the barrel CPU's for 30+ simultaneous encodes feels like cheating. Hardware-accelerated encoding also provides another massive win by tangibly reducing latency for our users vs CPU/GPU encoding (it's not just the throughput that's improved, each live frame gets through the pipeline faster too).
I wouldn't use COTS hardware accelerators for archiving Blu-Ray videos. Hell I'm not even aware of any COTS hardware accelerators that support HDR ... they probably exist but I've never stumbled across one. But hardware-accelerated encoding really is ideal for a lot of other stuff, especially when you care about CapEx at scale. If you're at the scale of Netflix or YouTube, you can get custom silicon made that can provide ASIC acceleration for any quality you like. That said, they seem to choose to degrade video quality to save money all the way to the point that 10-20% of their users hate the quality (myself included, quality is one of the primary reasons I use PassThePopcorn instead of the legal streaming services), but that's a business choice, not a technical limitation of ASIC acceleration (that's if you have the scale to pay for custom silicon...COTS solutions absolutely DO have a noticeable quality loss, as you argue).
> We're using it for RDP/KVM-type video, so for us the quality loss is indeed quite "small". Our users care more about "can I read the text clearly?" and less about color-banding. The hardware accelerators do a great job with text clarity so for our use-case it's not much of a noticeable quality loss.
This is a perfect use case for hardware video acceleration.
The hardware encoder blocks are great for anything live streaming related. The video they produce uses a lot higher bitrate and has lower quality than what you could get with a CPU encoder, but if doing a lot of real-time encodes is important then they deliver.
Common video codecs are often hardware accelerated.
This should be on the CPU side quite often, as there are a lot of systems without dedicated GPUs that still play video, like Notebooks and smart phones.
So in the end it's less about being parallelizable, but if it beats dedicated hardware, to which the answer should almost always be no.
P.S.: In video decoding speed is only relevant up to a certain point. That being: "Can I decode the next frame(s) in time to show it/them without stuttering". Once that has been achieved other factors such as power drainage become more important.
It is my understanding that hardware accelerated video encoders (as in the fixed-function ones built into consumer GPUs) produce a lower quality output than software-based encoders. They're really only there for on-the-fly encoding like streaming to twitch or recording security camera footage. But if you're encoding your precious family memories or backing up your DVD collection, you want to use software encoders. Therefore, if a hypothetical software h264 encoder could be faster on the GPU, it would have value for anyone doing not-on-the-fly encoding of video where they care about the quality.
> ... That being: "Can I decode the next frame(s) in time to show it/them without stuttering".
Except when you are editing video, or rendering output. When you have multiple streams of very high definition input, you definitely need much more than realtime speed decoding of a single video.
And you would want to scrub around the video(s), jumping to any timecode, and get the target frame preferably showing as soon as your monitor refreshes.
A GPU's job is to take inputs at some resolution, transform it, and then output it at that resolution. H.264/H.265 (and really, any playback format) needs a fundamentally different workflow: it needs to take as many frames as your framerate is set to, store the first frame as a full frame, and then store N-1 diffs, only describing which pixels changed between each successive frame. Something GPUs are terrible at. You could certainly use the GPU to calculate the full frame diff, but then you still need to send it back to the CPU or dedicated encoding hardware that turns that into an actual concise diff description. At that point, you might as well make the CPU or hardware encoder do the whole job, you're just not saving any appreciable time by sending the data over to the GPU first, just to get it back in a way where you're still going over every pixel afterwards.
Vulkan Compute shaders make GPU acceleration practical for intensive codecs like FFv1, ProRes RAW, and DPX. Previous hybrid GPU + CPU suffered the round-trip latency. These are fully GPU hands offs. A big deal for editing workflows.
> Most popular codecs were designed decades ago, when video resolutions were far smaller. As resolutions have exploded, those fixed-size minimum units now represent a much smaller fraction of a frame — which means far more of them can be processed in parallel. Modern GPUs have also gained features enabling cross-invocation communication, opening up further optimization opportunities.
One only needs to look at GPU driven rendering and ray tracing in shaders to deduce that shader cores and memory subsystems these days have become flexible enough to do work besides lock-step uniform parallelism where the only difference was the thread ID.
Nobody strives for random access memory read patterns, but the universal popularity of buffer device address and descriptor arrays can be taken somewhat as proof that these indirections are no longer the friction for GPU architectures that they were ten years ago.
At the same time, the languages are no longer as restrictive as they once were. People are recording commands on the GPU. This kind of fiddly serial work is an indication that the ergonomics of CPU programming have less of a relative advantage, and that cuts deeply into the tradeoff costs.
Yeah, Vulkan is shedding most of the abstractions off. Buffers are no longer needed - just device addresses. Shaders don't need to be baked into a pipeline - you can use shader objects. Even images rarely provide any speedup advantages over buffers, since texel cache is no longer separate from memory cache.
GPUs these days have massive cache often hundreds of megabytes large, on top of an already absurd amount of registers. A random read will often load a full cacheline into a register and keep it there, reusing it as needed between invocations.
Yes, but no. No, in that these days, GPUs are entirely scalar from the point of view of invocations. Using vectors in shaders is pointless - it will be as fast as scalar variables (double instruction dispatch on AMD GPUs is an exception).
But yes from the point of view that a collection of invocations all progressing in lockstep get arithmetic done by vector units. GPUs have just gotten really good at hiding what happens with branching paths between invocations.
SIMT is distinct model. Ergonomics are wildly different. Instead of contracting a long iteration by packing its steps together to make them "wider", you rotate the iteration across cores.
The critical difference is that SIMD and parallel programming are totally different in terms of ergonomics while SIMT is almost exactly the same as parallel programming. You have to design for SIMD and parallelism separately while SIMT and parallelism are essentially the same skill set.
The fan-in / fan-out and iteration rotation are the key skills for SIMT.
Well, the problem with hardware decoding is it cannot handle all the variations in data corruption which results in hardware crash, sometimes not recoverable with a soft reset of the hardware block.
It is usually more reasonable to work with software decoders for really complex formats, or only to accelerate some heavy parts of the decoding where data corruption is really easy to deal with or benign, or aim for the middle ground: _SIMPLE_ and _VERY CONSERVATIVE_ compute shaders.
Sometimes, the software cannot even tell the hardware is actually 'crashed' and spitting non-sense data. It goes even worse, some hardware block hot reset actually do not work and require a power cycle... Then a 'media players' able to use hardware decoding must always provide a clear and visible 'user button' in order to let this very user switch to full software decoding.
Then, there is the next step of "corruption": some streams out there are "wrong", but this "wrong" will be decoded ok on only some specific decoders and not other ones even though the format is following the same specs.
What a mess.
I hope those compute shaders are not using that abomination of glsl(or the dx one) namely are SPIR-V shaders generated with plain and simple C code.
These are all gripes you might have with Vulkan Video.
Unlike with Vulkan Video, in Compute, bounds checking is the norm. Overreading a regular buffer will not result in a GPU hang or crash. If you use pointers, it will, but if you use pointers, its up to you to check if overreads can happen.
The bitstream reader in FFmpeg for Vulkan Compute codecs is copied from the C code, along with bounds checking. The code which validates whether a block is corrupt or decodable is also taken from the C version. To date, I've never got a GPU hang while using the Compute codecs.
I wrote the Vulkan ProRes backend. The bitstream decoder was implemented from scratch, for a number of reasons.
First, the original code was reverse-engineered, before Apple published an SMPTE document describing the bitstream syntax. Second, I tried my best at optimizing the code for GPU hardware. And finally, I wanted take the learning opportunity :)
What is the use case? Okay, ultra low latency streaming. That is good. But. If you are sending the frames via some protocol over the network, like WebRTC, it will be touching the CPU anyway. Software encoding of 4K h264 is real time on a single thread on 65w, decade old CPUs, with low latency. The CPU encoders are much better quality and more flexible. So it's very difficult to justify the level of complexity needed for hardware video encoding. Absolutely no need for it for TV streaming for example. But people keep being obsessed with it who have no need for it.
IMO vendors should stop reinventing hardware video encoding and instead assign the programmer time to making libwebrtc and libvpx better suit their particular use case.
davinci resolve is the only commercial NLE with any kind of vulkan support, and it is experimental
prores decodes faster than realtime single threaded on a decade old CPU too
it doesn't make sense. it's much different with say, a video game, where a texture will be loaded once into VRAM, and then yes, all the work will be done on the GPU. a video will have CPU IO every frame, you are still doing a ton of CPU work. i don't know why people are talking about power efficiency, in a pro editing context, your CPU will be very, very busy with these IO threads, including and especially in ffmpeg with hardware encoding/decoding nonetheless. it doesn't look anything like a video game workload which is what this stack is designed for.
6k ProRes streams that consumer cameras record in are still too heavy for modern CPUs to decode in realtime. Not to mention 12k ProRes that professional cameras output.
> If you are sending the frames via some protocol over the network, like WebRTC, it will be touching the CPU anyway. Software encoding of 4K h264 is real time on a single thread on 65w, decade old CPUs, with low latency.
This is valid for a single stream, but the equation changes when you're trying to squeeze the highest # of simultaneous streams into the least amount of CapEx possible. Sure, you still have to transfer it to the CPU cache just before you send it over WebRTC/HTTP/whatever, but you unlock a lot of capacity by using all the rest of the silicon as much as you can. Being able to use a budget/midrange GPU instead of a high-end ultra-high-core-count CPU could make a big difference to a business with the right use-case.
That said, TFA doesn't seem to be targeting that kind of high stream density use-case either. I don't think e.g. Frigate NVR users are going to switch to any of the mentioned technologies in this blog post.
The article explicitly mentions that mainstream codecs like H264 are not the target.
This is for very high bitrate high resolution professional codecs.
I haven't actually looked into this but it might not be the realm of possibility. But you are generating a frame on GPU, if you can also encode it there, either with nvenc or vulkan doesn't matter. Then DMA the to the nic while just using the CPU to process the packet headers, assuming that cannot also be handled in the GPU/nic
It's hugely more efficient, if you're on a battery powered device it could mean hours more of play time. It's pretty insane just how much better it is (I go through a bit of extra effort to make sure it's working for me, hw decoding isn't includes in some distros).
It’s a leftover mindset from the mid-2000s when GPGPU became possible, and additional performance was “unlocked” from an otherwise under-utilized silicon.
This article assumes all GPUs are on a PCIe bus but some are part of the CPU so the distance problem is minimal and offloading to GPU might still be net +. Might because I haven't tested this
47 comments:
The main target for this are NLEs like Blender. Performance is a large part of the issue. Most users still just create TIFF files per frame before importing them into a "real editor" like Resolve. Apple may have ASICs for ProRes decoding, and Resolve may be the standard editor that everyone uses.
But this goes beyond what even Apple has, by making it possible to work directly with compressed lossless video on consumer GPUs. You can get hundreds of FPS encoding or decoding 4k 16-bit FFv1 on a 4080, while only reading a few gigabits of video per second, rather than tens and even hundreds of gigabits that SSDs can't keep up. No need to have image degradation when passing intermediate copies between CG programs and editing either.
Yep! Almost finished implementing support in https://ossia.io which is going to become the first open-source cross-platform real-time visuals software to support live scrubbing for VJ use cases, in 4K+ prores files on not that big of a GPU (tested on my laptop 3060) :)
How to feed MilkDrop music visualizations?
(MilkDrop3, projectm-visualizer/presets-cream-of-the-crop, westurner/vizscan for photosensitive epilepsy)
mapmapteam/mapmap does open source multi-projector mapping. How to integrate e.g. mapmap?
BespokeSynth is a C++ and JUCE based patch bay software modular synth with a "node-based UI" and VST3, LV2, AudioUnit audio plugin support. How to feed BespokeSynth audio and possibly someday video? Pipewire and e.g. Helvum?
I don’t understand the spread of thoughts in your post.
The reason to create image sequences is not because you need to send it to other apps, it’s because you preserve quality and safeguard from crashes.
A crash mid video write out can corrupt a lengthy render. With image sequences you only lose the current frame.
People aren’t going to stop using image sequences even if they stayed in the same app.
And I’m not sure why this applies: “this goes beyond” what Apple has, because they do have hardware support for decoding several compressed codecs (also I’ll note that ProRes is also compressed). Other than streaming, when are you going to need that kind of encode performance? Or what other codecs are you expecting will suddenly pop up by not requiring ASICs?
Also how does this remove degradation when going between apps? Are you envisioning this enables Blender to stream to an NLE without first writing a file to disk?
> A crash mid video write out can corrupt a lengthy render. With image sequences you only lose the current frame.
You wouldn't contain FFv1 in MP4, the only format incompetent enough for such corruption.
Apple has an interest against people using codecs that they get no fees from. And Apple don't have a lossless codec. So they don't offer lossless compressed video acceleration.
The idea is that when working as a part of a team, and you get handed a CG render, you can avoid sending a huge .tar or .zip file full of TIFF which you then decompress, or ProRes which loses quality, particularly when in a linear colorspace like ACEScg.
I’m curious what kind of teams you’re working in that you’re handing compressed archives of image sequences? And using tiff vs EXR (unless you mean purely after compositing)?
Another reason to use image sequences is that it’s easier to re-render just a portion of the sequence easily. Granted this can be done with video too, but has higher overhead.
But even then why does the GPU encoding change the fact that you’d send it to another NLE? I just feel like there are a lots of jump in thought process here.
I thought an industry standard was to use proxy files. Open source editor Shotcut use them for example. Create a low resolution + intra-frame only version of the file for very fast scrubbing, make your edits on that, and when done the edit list is applied to the full resolution rushes to produce the output.
Often but not always. Sometimes you’re just working with proxies directly, audio mixing and the like. VFX workflows, finishing will be online full res often.
But even so everybody is often making their own proxies all the time. There’s a lot of passing around of ProRes Proxy or another intermediate quality format and you still make even lighter proxies locally so NLEs and workstation apps will still benefit from this
Proxy files have issues when doing coloring, greenscreens, effects shots. The bit depth, chroma resolution, primaries/transfer/colorspace gets changed. Basically only really usable when editing. With this, you don't need proxy files at all.
A lot of the confusion in this thread feels like it comes from thinking in terms of web streaming rather than the workloads this post is targeting.
The article is pretty explicit that this is not about "make Twitch more efficient" or squeezing a bit more perf out of H.264. It is about mezzanine and archival formats that are already way beyond what a single CPU, even a decade old workstation CPU, handles comfortably in real time: 4K/6K/8K+ 16‑bit, FFv1-style lossless, ProRes RAW, huge DPX sequences, etc. People cutting multi‑camera timelines of that kind of material are already on the wrong side of the perf cliff and are often forced into very specific hardware or vendors.
What Vulkan compute buys you here is not "GPUs good, CPUs bad", it is the ability to keep the entire codec pipeline resident on the GPU once the bitstream is there, using the same device that is already doing color, compositing and FX, and to do it in a portable way. FFmpeg’s model is also important: all the hairy parts stay in software (parsing, threading, error handling), and only the hot pixel crunching is offloaded. That makes this much more maintainable than the usual fragile vendor API route and keeps a clean fallback path when hardware is not available.
From a practical angle, this is less about winning a benchmark over a good CPU encoder for 4K H.264, and more about changing what is feasible on commodity hardware: e.g., scrubbing multiple streams of 6K/8K ProRes or FFv1 on a consumer GPU instead of needing a fat workstation or dailies transcoded to lighter proxies. For people doing archival work or high end finishing on a budget, that is a real qualitative change, not just an incremental efficiency tweak.
I once asked on #ffmpeg@libera if the GPU could be used to encode h264, and apparently yes, but it's not really worth it compared to CPU.
I don't know much about video compression, does that mean that a codec like h264 is not parallelizable?
One of the choke points of all modern video codecs that focus on potential high compression ratios is the arithmetic entropy coding. CABAC for h264 and h265, 16-symbol arithmetic coding for AV1. There is no way to parallelize that AFAIK: the next symbol depends on the previous one. All you can do is a bit of speculative decoding but that doesn’t go very deep. Even when implemented in hardware, the arithmetic decoding is hard to parallelize.
This is especially a choke point when you use these codecs for high quality settings. The prediction and filtering steps later in the decoding pipeline are relatively easy to parallelize.
High throughput CODECs like ProRes don’t use arithmetic coding but a much simpler, table based, coding scheme.
FFv1's range coder has higher complexity than CABAC. The issue is serialization. Mainstream codecs require that the a block depends on previously decoded blocks. Tiles exist, but they're so much larger, and so rarely used, that they may as well not exist.
> the GPU could be used to encode h264, and apparently yes, but it's not really worth it compared to CPU.
It depends on what you're going for. If you're trying to do the absolute highest fidelity for archiving a blu-ray disk, AMD Epyc reigns supreme. That's because you need a lot of flexibility to really dial in the quality settings. Pirates over at PassThePopcorn obsess over minute differences in quality that I absolutely cannot notice with my eyes, and I'm glad they do! Their encodings look gorgeous. This quality can't be achieved with the silicon of hardware-accelerated encoders, and due to driver limitations (not silicon limitations) also cannot be achieved by CUDA cores / execution engines / etc on GPUs.
But if you're okay with a small amount of quality loss, the optimum move for highest # of simultaneous encodes or fastest FPS encoding is to skip the CPU and GPU "general compute" entirely - going with hardware accelerated encoding can get you 8-30 1080p simultaneous encodes on a very cheap intel iGPU using QSV/VAAPI encoding. This means using special sections of silicon whose sole purpose is to perform H264/H265/etc encoding, or cropping / scaling / color adjustments ... the "hardware accelerators" I'm talking about are generally present in the CPU/iGPU/GPU/SOC, but are not general purpose - they can't be used for CUDA/ROCm/etc. Either they're being used for your video pipeline specifically, or they're not being used at all.
I'm doing this now for my startup and we've tuned it so it uses 0% of the CPU and 0% of the Render/3D engine of the iGPU (which is the most "general purpose" section of the GPU, leaving those completely free for ML models) and only utilizing the Video Engine and Video Enhance engines.
For something like Frigate NVR, that's perfect. You can support a large # of cameras on cheap hardware and your encoding/streaming tasks don't load any silicon used for YOLO, other than adding to overall thermal limits.
Video encoding is a very deep topic. You need to have benchmarks, you need to understand not just "CPU vs GPU" ... but down to which parts of the GPU you're using. There's an incredible amount of optimization you can do for your specific task if you take the time to truly understand the systems level of your video pipeline.
> But if you're okay with a small amount of quality loss,
I wouldn't call it a small quality loss. The hardware encoders are tuned for different priorities like live streaming. They have lower quality and/or much higher bitrate.
> If you're trying to do the absolute highest fidelity for archiving a blu-ray disk, AMD Epyc reigns supreme.
You don't need any special CPU to get the highest fidelity as long as you're willing to wait. For archiving purposes any CPU will do, just be prepared to let it run for a long time.
> You don't need any special CPU to get the highest fidelity as long as you're willing to wait.
Correct, but Epyc "reigns supreme" for anyone caring about performance / total FPS throughput, which is relevant for anyone who cares about TFA at all - the purpose of using GPU is to "go faster", and that's what Epyc offers for use cases that also care about extreme fidelity.
> I wouldn't call it a small quality loss. The hardware encoders are tuned for different priorities like live streaming. They have lower quality and/or much higher bitrate.
Sure. It absolutely depends on your use case. We're using it for RDP/KVM-type video, so for us the quality loss is indeed quite "small". Our users care more about "can I read the text clearly?" and less about color-banding. The hardware accelerators do a great job with text clarity so for our use-case it's not much of a noticeable quality loss. I will admit the colors are very noticeably distorted, but the shapes are correct and the contrast/sharpness is good.
Using 0% of the CPU and GPU for encoding is a HUGE win that's totally worth it for us - hardware costs stay super low. Using really old bottom of the barrel CPU's for 30+ simultaneous encodes feels like cheating. Hardware-accelerated encoding also provides another massive win by tangibly reducing latency for our users vs CPU/GPU encoding (it's not just the throughput that's improved, each live frame gets through the pipeline faster too).
I wouldn't use COTS hardware accelerators for archiving Blu-Ray videos. Hell I'm not even aware of any COTS hardware accelerators that support HDR ... they probably exist but I've never stumbled across one. But hardware-accelerated encoding really is ideal for a lot of other stuff, especially when you care about CapEx at scale. If you're at the scale of Netflix or YouTube, you can get custom silicon made that can provide ASIC acceleration for any quality you like. That said, they seem to choose to degrade video quality to save money all the way to the point that 10-20% of their users hate the quality (myself included, quality is one of the primary reasons I use PassThePopcorn instead of the legal streaming services), but that's a business choice, not a technical limitation of ASIC acceleration (that's if you have the scale to pay for custom silicon...COTS solutions absolutely DO have a noticeable quality loss, as you argue).
> We're using it for RDP/KVM-type video, so for us the quality loss is indeed quite "small". Our users care more about "can I read the text clearly?" and less about color-banding. The hardware accelerators do a great job with text clarity so for our use-case it's not much of a noticeable quality loss.
This is a perfect use case for hardware video acceleration.
The hardware encoder blocks are great for anything live streaming related. The video they produce uses a lot higher bitrate and has lower quality than what you could get with a CPU encoder, but if doing a lot of real-time encodes is important then they deliver.
Common video codecs are often hardware accelerated. This should be on the CPU side quite often, as there are a lot of systems without dedicated GPUs that still play video, like Notebooks and smart phones. So in the end it's less about being parallelizable, but if it beats dedicated hardware, to which the answer should almost always be no.
P.S.: In video decoding speed is only relevant up to a certain point. That being: "Can I decode the next frame(s) in time to show it/them without stuttering". Once that has been achieved other factors such as power drainage become more important.
It is my understanding that hardware accelerated video encoders (as in the fixed-function ones built into consumer GPUs) produce a lower quality output than software-based encoders. They're really only there for on-the-fly encoding like streaming to twitch or recording security camera footage. But if you're encoding your precious family memories or backing up your DVD collection, you want to use software encoders. Therefore, if a hypothetical software h264 encoder could be faster on the GPU, it would have value for anyone doing not-on-the-fly encoding of video where they care about the quality.
One source for the software encoder quality claim is the "transcoding" section of this article: https://chipsandcheese.com/i/138977355/transcoding
> ... That being: "Can I decode the next frame(s) in time to show it/them without stuttering".
Except when you are editing video, or rendering output. When you have multiple streams of very high definition input, you definitely need much more than realtime speed decoding of a single video.
And you would want to scrub around the video(s), jumping to any timecode, and get the target frame preferably showing as soon as your monitor refreshes.
I think it's mostly because most cpus that can run a gpu already have parts dedicated as h264 encoder, way more efficient energy wise and speed wise.
This is literally what the article is about. It answers your questions.
A GPU's job is to take inputs at some resolution, transform it, and then output it at that resolution. H.264/H.265 (and really, any playback format) needs a fundamentally different workflow: it needs to take as many frames as your framerate is set to, store the first frame as a full frame, and then store N-1 diffs, only describing which pixels changed between each successive frame. Something GPUs are terrible at. You could certainly use the GPU to calculate the full frame diff, but then you still need to send it back to the CPU or dedicated encoding hardware that turns that into an actual concise diff description. At that point, you might as well make the CPU or hardware encoder do the whole job, you're just not saving any appreciable time by sending the data over to the GPU first, just to get it back in a way where you're still going over every pixel afterwards.
Vulkan Compute shaders make GPU acceleration practical for intensive codecs like FFv1, ProRes RAW, and DPX. Previous hybrid GPU + CPU suffered the round-trip latency. These are fully GPU hands offs. A big deal for editing workflows.
could this have an AV1 decoder for low power hardware that are without AV1 gpu accelerated decoding? for my N4020 laptop.
maybe a raspberry pi 4 too.
> Most popular codecs were designed decades ago, when video resolutions were far smaller. As resolutions have exploded, those fixed-size minimum units now represent a much smaller fraction of a frame — which means far more of them can be processed in parallel. Modern GPUs have also gained features enabling cross-invocation communication, opening up further optimization opportunities.
One only needs to look at GPU driven rendering and ray tracing in shaders to deduce that shader cores and memory subsystems these days have become flexible enough to do work besides lock-step uniform parallelism where the only difference was the thread ID.
Nobody strives for random access memory read patterns, but the universal popularity of buffer device address and descriptor arrays can be taken somewhat as proof that these indirections are no longer the friction for GPU architectures that they were ten years ago.
At the same time, the languages are no longer as restrictive as they once were. People are recording commands on the GPU. This kind of fiddly serial work is an indication that the ergonomics of CPU programming have less of a relative advantage, and that cuts deeply into the tradeoff costs.
Yeah, Vulkan is shedding most of the abstractions off. Buffers are no longer needed - just device addresses. Shaders don't need to be baked into a pipeline - you can use shader objects. Even images rarely provide any speedup advantages over buffers, since texel cache is no longer separate from memory cache.
GPUs these days have massive cache often hundreds of megabytes large, on top of an already absurd amount of registers. A random read will often load a full cacheline into a register and keep it there, reusing it as needed between invocations.
These GPUs are still big SIMD devices at their core though, no?
Yes, but no. No, in that these days, GPUs are entirely scalar from the point of view of invocations. Using vectors in shaders is pointless - it will be as fast as scalar variables (double instruction dispatch on AMD GPUs is an exception).
But yes from the point of view that a collection of invocations all progressing in lockstep get arithmetic done by vector units. GPUs have just gotten really good at hiding what happens with branching paths between invocations.
SIMT is distinct model. Ergonomics are wildly different. Instead of contracting a long iteration by packing its steps together to make them "wider", you rotate the iteration across cores.
The critical difference is that SIMD and parallel programming are totally different in terms of ergonomics while SIMT is almost exactly the same as parallel programming. You have to design for SIMD and parallelism separately while SIMT and parallelism are essentially the same skill set.
The fan-in / fan-out and iteration rotation are the key skills for SIMT.
Well, the problem with hardware decoding is it cannot handle all the variations in data corruption which results in hardware crash, sometimes not recoverable with a soft reset of the hardware block.
It is usually more reasonable to work with software decoders for really complex formats, or only to accelerate some heavy parts of the decoding where data corruption is really easy to deal with or benign, or aim for the middle ground: _SIMPLE_ and _VERY CONSERVATIVE_ compute shaders.
Sometimes, the software cannot even tell the hardware is actually 'crashed' and spitting non-sense data. It goes even worse, some hardware block hot reset actually do not work and require a power cycle... Then a 'media players' able to use hardware decoding must always provide a clear and visible 'user button' in order to let this very user switch to full software decoding.
Then, there is the next step of "corruption": some streams out there are "wrong", but this "wrong" will be decoded ok on only some specific decoders and not other ones even though the format is following the same specs.
What a mess.
I hope those compute shaders are not using that abomination of glsl(or the dx one) namely are SPIR-V shaders generated with plain and simple C code.
These are all gripes you might have with Vulkan Video. Unlike with Vulkan Video, in Compute, bounds checking is the norm. Overreading a regular buffer will not result in a GPU hang or crash. If you use pointers, it will, but if you use pointers, its up to you to check if overreads can happen.
The bitstream reader in FFmpeg for Vulkan Compute codecs is copied from the C code, along with bounds checking. The code which validates whether a block is corrupt or decodable is also taken from the C version. To date, I've never got a GPU hang while using the Compute codecs.
I wrote the Vulkan ProRes backend. The bitstream decoder was implemented from scratch, for a number of reasons.
First, the original code was reverse-engineered, before Apple published an SMPTE document describing the bitstream syntax. Second, I tried my best at optimizing the code for GPU hardware. And finally, I wanted take the learning opportunity :)
And to answer the parent's question, the shaders are written in pure GLSL. For instance, this is the ProRes bitstream decoder in question: https://code.ffmpeg.org/FFmpeg/FFmpeg/src/branch/master/liba...
What is the use case? Okay, ultra low latency streaming. That is good. But. If you are sending the frames via some protocol over the network, like WebRTC, it will be touching the CPU anyway. Software encoding of 4K h264 is real time on a single thread on 65w, decade old CPUs, with low latency. The CPU encoders are much better quality and more flexible. So it's very difficult to justify the level of complexity needed for hardware video encoding. Absolutely no need for it for TV streaming for example. But people keep being obsessed with it who have no need for it.
IMO vendors should stop reinventing hardware video encoding and instead assign the programmer time to making libwebrtc and libvpx better suit their particular use case.
The article explains it. This is not for streaming over the web, but for editing professional grade video on consumer hardware.
davinci resolve is the only commercial NLE with any kind of vulkan support, and it is experimental
prores decodes faster than realtime single threaded on a decade old CPU too
it doesn't make sense. it's much different with say, a video game, where a texture will be loaded once into VRAM, and then yes, all the work will be done on the GPU. a video will have CPU IO every frame, you are still doing a ton of CPU work. i don't know why people are talking about power efficiency, in a pro editing context, your CPU will be very, very busy with these IO threads, including and especially in ffmpeg with hardware encoding/decoding nonetheless. it doesn't look anything like a video game workload which is what this stack is designed for.
6k ProRes streams that consumer cameras record in are still too heavy for modern CPUs to decode in realtime. Not to mention 12k ProRes that professional cameras output.
That reduces power consumption. So should improve battery life of laptops and help environment a little.
> If you are sending the frames via some protocol over the network, like WebRTC, it will be touching the CPU anyway. Software encoding of 4K h264 is real time on a single thread on 65w, decade old CPUs, with low latency.
This is valid for a single stream, but the equation changes when you're trying to squeeze the highest # of simultaneous streams into the least amount of CapEx possible. Sure, you still have to transfer it to the CPU cache just before you send it over WebRTC/HTTP/whatever, but you unlock a lot of capacity by using all the rest of the silicon as much as you can. Being able to use a budget/midrange GPU instead of a high-end ultra-high-core-count CPU could make a big difference to a business with the right use-case.
That said, TFA doesn't seem to be targeting that kind of high stream density use-case either. I don't think e.g. Frigate NVR users are going to switch to any of the mentioned technologies in this blog post.
The article explicitly mentions that mainstream codecs like H264 are not the target. This is for very high bitrate high resolution professional codecs.
I'm not entirely sure that this is true.
I haven't actually looked into this but it might not be the realm of possibility. But you are generating a frame on GPU, if you can also encode it there, either with nvenc or vulkan doesn't matter. Then DMA the to the nic while just using the CPU to process the packet headers, assuming that cannot also be handled in the GPU/nic
You can also often DMA video coming in through peripherals to get it straight into the GPU, skipping the CPU.
It will be more energy efficient. And the CPU is free to jit half a gig of javascript in the mean time.
It's hugely more efficient, if you're on a battery powered device it could mean hours more of play time. It's pretty insane just how much better it is (I go through a bit of extra effort to make sure it's working for me, hw decoding isn't includes in some distros).
If the frames already live on the GPU, pulling them over PCIe just to feed a CPU encoder is wasted bandwidth and latency.
It’s a leftover mindset from the mid-2000s when GPGPU became possible, and additional performance was “unlocked” from an otherwise under-utilized silicon.
This article assumes all GPUs are on a PCIe bus but some are part of the CPU so the distance problem is minimal and offloading to GPU might still be net +. Might because I haven't tested this