Power Developer - Project

i.MX515 Project

Profiling the i.MX515 GL stack

in category Graphics & 3D
proposed by blu on 30th August 2010 (accepted on 1st September 2010)

[View Full Project]

Draw calls: the cost of living

posted by blu on 12th October 2010

Usually the first thing people try to keep in check in any draw scenario is the number of draw calls. And for a good reason - draw calls are the dirty, CPU-side plumbing of every beautiful GPU pipeline. They are the 'CPU wildcard' factor in any graphics pipeline timing statistic. That is even more true for small embedded systems, where CPU performance is not abound.

So how expensive exactly are the draw calls on our iMX515, with GL ES software stack dated 2010.07.11, and kernel 2.6.31.14 from gitorious? We are about to see next.

For the purpose we need a sound methodology. Let's devise one:

1. Choose what kind of GPU work we are going to measure - per-pixel, per-vertex, or something else. We here choose per-pixel, for reasons that will become obvious below.
2. Draw something of easily-calculable pixel area. We choose a full-screen rectangle.
3. Draw the same thing in a designated 'discard' mode, where the GPU pixel work is brought to negligible or none. For instance, cull the original primitive by inverting the polygon winding - that qualifies as 'none', and this is what we will do.

The time difference between (2) and (3) is then our actual GPU workload. Everything else outside of this time, are costs likely associated with the CPU's dirty job of carrying out our draw calls.

So, let's do a test run first. Let's start with a relatively dense vertex grid for our pixel rectangle, a screen overdraw factor of 1 (i.e. we cover the screen active area just once), and use a single draw call per frame. Additionally, for purity of the test, we will use a 'pass-through' shader (pass vertex coords unmodified, output a fixed color), and a frame-skip ratio of 1:255 (eglSwapBuffers vs. glFinish). So:

viewport of 512x512 pixels
viewport-spanning grid mesh (we choose indexed triangle list)
- number of vertices: 2145
- number of indices: 12288 (that's 4,096 triangles - two per grid cell, on a grid of 32x64)
elapsed time for 1000 frames, drawing ON: 6.4s
elapsed time for 1000 frames, drawing OFF (culling active): 4.5s
effective GPU pixel-munching time: 1.9s
effective fillrate: 137,970,526 pixels/s

Hmm.. We currently waste some precious unified-shader time in vertex jiggling. Let's knock down a bit that mesh complexity. Let's switch to a grid of 2x2 cells (8 triangles).

number of vertices: 9
number of indices: 24
elapsed time for 1000 frames, drawing ON: 5.43s
elapsed time for 1000 frames, drawing OFF (culling active): 3.81s
effective GPU pixel-munching time: 1.62s
effective fillrate: 161,817,284 pixels/s

That's not so far from the theoretical 166 Mpix/s the iMX515 is rated at. When taking into account that timing above is app-based (through the 'time' utility) and app does some house-keeping, etc, I think we can assume our result is within a reasonable error from the theoretical maximum. Also, by now it should be clear why we chose to measure pixel workloads - because we can verify them against the specs.

So, at this stage we have a proven way to separate the GPU workload time from the time of other workloads in our drawing pipeline. So let's track down that CPU workload.

Clearly, in the case of a-few-'dumb'-pixels-worth of GPU work, the draw-call costs are not pretty - that's 1.62s of GPU work vs. 3.81s of 'non-GPU' work, or 0.425:1 in favor of the CPU. Ouch. We really need to try to increase the amount of work we pass down per draw call. The simplest way for that would be through, yep, you guessed it right - increasing the resolution. So,

viewport of 1024x768 pixels
viewport-spanning grid mesh (same indexed triangle list)
- number of vertices: 9
- number of indices: 24
elapsed time for 1000 frames, drawing ON: 14.03s
elapsed time for 1000 frames, drawing OFF (culling active): 9.26s
effective GPU pixel-munching time: 4.77s
effective fillrate: 164,870,440 pixels/s

Now the ratio is 0.515:1 - slightly better, but still not good. Let's see what might be the issue. Let's do more than 1 draw call per frame, as that would give us some idea if something else might be taking place in our frame.

viewport of 1024x768 pixels
viewport-spanning grid mesh (same indexed triangle list), drawn 4 times (i.e. 4 draw calls per frame)
- number of vertices: 9
- number of indices: 24
elapsed time for 1000 frames, drawing ON: 28.37s
elapsed time for 1000 frames, drawing OFF (culling active): 9.39s
effective GPU pixel-munching time: 18.98s
effective fillrate: 165,739,093 pixels/s

Aha! The picture changed drastically - the GPU-pixel-workload vs other-stuff ratio is now 2.021:1 in favor of the GPU! And that happened after we increased the number of draw calls from 1 to 4 per frame.

That comes to show that we have other expenditures in our frame. That would indicate that we should not try to bluntly decrease the number of our draw calls, but instead find the 'sweet spot' where a light GPU workload can be spread across a few draw calls, and that would still be 'for free', or pretty cheap in the timing of our frame. Unfortunately, here we leave the area of synthetic tests and hypotetsizing, and step into the real-world workloads. Or in other words, our small investigation ends here. We may return to it in the future, perhaps with some real-world data to analyze.