Saturday, February 9, 2013

Charlie & The Compute-Shader factory #2

If you wondered what Compute Shaders are, and red the first post, your question still isn't answered probably. Parallel computing, GPU's, Umpa Lumpa's, what else? Yet, it's important to understand those fundaments a bit. Sometimes you gotta the know the why's before doing something. In the programming world, there are too many techniques and ways to accomplish things, so before just wasting time on yet another technique like these Compute Shaders, it's pretty useful to sort out why (or why not) you may need them. I suppose you don't just buy thermo-nuclear particle accelerators without really knowing what they are either.

I'll be honest with you, so far zero Compute Shaders are part of the Tower22 engine. I made several, but either my outdated hardware didn't support some specific features, or I could replace it with other (simpler) methods. Like Geometry Shaders, CS (Compute Shaders) aren't exactly required for each and every situation. Most of the rendering just suits fine with the existing OpenGL, DirectX, Vertex/Geometry/Fragment shaders, so I wouldn't suddenly swap to another technique if not really needed. Certainly not as these CS are still a bit premature and (slightly) harder to write. Old fashioned shaders debug easier, and might even run a bit faster.


That said, now let's focus on Compute Shaders, and in particular on their advantage over traditional shaders. Yes, we're getting a bit more technical, so you can skip this dance if you don't give a damn about programming. Like most other programs, a CS takes input like numeric parameters or buffers, and it writes output back in buffers. A cs doesn't draw polygons or anything (remember I said a CS doesn’t have anything to do with rendering), it just fills buffers with numbers. That’s it. Writing buffers is not exactly the definition of Cool, but you must realize that common shaders basically do the same. But with the exception that these shaders are tightly integrated in the graphics rendering pipeline (to safe you work, and to protect you from screwing things up).

These in- or output buffers are usually:
* arrays of numbers or vectors (like a vertex array)
* arrays of structs (multiple attributes (per vertex))
* 1D, 2D or 3D texture (OpenGL / DirectX)
Those structs or numbers could be anything, but in a 3D context it makes sense to use OpenGL or DirectX buffers such as VBO's or Textures to work with, so the output is stored on the GPU in a way OpenGL or DirectX can proceed with. To give a practical example, you could do Vertex Skinning (animating with skeleton bones) in a Compute Shader;
- Make a VBO containing all vertices, texcoords, normals, weights and bone Indices in OpenGL
- Let the CPU update a skeleton (= an array of bone matrices)
- Pass the VBO and Skeleton arrays to a Compute Shader
- Let the CS calculate the updated vertex positions by multiplying them with the bone matrices
- Let the CS stream out the results to (another) VBO
- Later on, render the updated VBO (the one with the end result vertex positions / normals)

For those who did animations before, you can do the same with vertex Transform Feedback (OpenGL) or Streaming (DirectX), so why use a Compute Shader instead? Well, you don't have to. I would stick with OpenGL or DirectX actually. However, there are scenario's where a CS fits better, as they are more flexible. Down below I'll list some main features of CS that are different from common shaders. But first, and good to know, you can implement CS in your app by using either OpenCL (by Khronos, the team also behind OpenGL) or nVidia's CUDA. And possibly there are more API's, but these two seem to be the best known ones. So far I only tried OpenCL, so let's focus on that one. But I guess CUDA isn't much different. Like OpenGL, OpenCL comes as a DLL with a bunch of functions to get system information, compile shaders, make buffers, share interop buffers/textures between OpenGL * OpenCL, and to launch them.

........For OpenGL / Delphi fans, they didn’t forget about us, several libraries and examples were made:
.............http://code.google.com/p/delphi-opencl/
.............http://download.cnet.com/OpenCL-for-Borland-Delphi/3000-2070_4-11881405.html
.............http://www.brothersoft.com/opencl-for-borland-delphi-449951.html
........Also, make sure to print these papers and use them as wallpaper:
.............OpenCL function card

Some very basic code examples


OpenCL super powers
===============================================
* Simplicity
Can't speak for DirectX, but in GL, it often takes quite some steps to setup a buffer, create a rendering context, get a shader doing something in a buffer, and so on. The OpenCL API is minimal. Once you wrote the basic setup steps (by looking at an example) to support CS in your application, it's really simple to use them anywhere, anytime.


* More flexible shader coding
Although premature and a shitty debugger (at least for OpenCL), the C-like code seems to allow more tricks. Where common shaders are still quite strict with dynamic loops or pointers, CS feels more like natural C. Disadvantage is that a lot of handy functions and syntaxis you're used to, are missing or different in OpenCL, so your first attempts to write are probably going to be frustrating.


* Let the CPU and GPU work in parallel
This already happens with common shaders, but for some reason, I'm not sure how the two synchronize. Anyway, with CS you can simply launch a task on the GPU (or another device) and continue doing other stuff on the CPU and check later if it's done. As said, OpenCL works simple.


* Array indexing or Pointers
A powerful feature is that you can access any slot in an array via indexing or pointers (warning: indexing = slow, pointers = fast!). In common shaders, this is not possible. While processing vertex[123], you can't look in vertex[94] for some info. You’re forced to use textures or UBO’s for data lookup then. Advanced data structures such as octrees can be accessed much easier. This is one of the main reasons you may want to use a CS, if complex data access is needed.


* CS can also write in the same input buffer
In a shader, you will always need 2 buffers. One input, one output. By "ping-ponging" you could swap buffers each cycle:
cycle1: input from buf1 , output to buf2
cycle2: input from buf2 , output to buf1
...
This costs double the memory, as you need two buffers. With the help of ReadWrite buffers in CS, you don't have this problem. ReadWrite textures are pretty slow or not even supported on all hardware though.


* CS in- and output don't have to be GPU hardware buffers
You can stream the results directly back to a CPU if you like. OpenGL or DX can do that as well, but it's A: slow, and B: it requires crazy tricks like reading pixels from a texture to push data back and forth between the CPU and GPU. Probably it's just as slow when using OpenCL, but at least it feels more natural as it can be coded easily.


* Shared variables
In a common shader, you can't declare a global variable like "myCounter" that is being incremented by each element being processed. But in CS, you actually can. This can become handy if you want to share the same data for a whole group of elements, count stuff, or filtering out min/max values. I'll show an example later on (Tiled Deferred Rendering).


* Threading control / Synchronizing
Now this is the Nutty Professor part. And the reason why you have to know how Umpa Lumpa's roll. First, it's up to you how you launch a CS. If you have 10.000 elements in an input array, you could for example run 20 Warps or Wavefronts, each taking care of 500 elements.

Since standard Vertex/Geom/Fragment shaders cannot access their neighbors in their buffers, each “workitem” runs isolated from the big bad world outside. So you don't have to care about synchronizing, mutexes, locks, semaphores, or whatsoever. But as shown above, in CS you actually can bother the neighbors or variables via local or global memory. And not without risk. He might attack you with a baseball if you interrupted him at the wrong time. Same troubles in CS land. If you read or write data being processed by another work-item, there is no guarantee that element already has been finished. Maybe it wasn't handled yet, or worse, maybe you caught it while it was being written. That's when you get the baseball bat in your face; corrupted values, tears and complete chaos.



Synchronizing the Multi-madness
===============================================
Fortunately, OpenCL provides some instructions to prevent this drama. But first of all, try to design your shaders in such a way that you don't have to read outside your comfort zone. Keep shared global variables or access to other elements to a minimum. You will learn that sometimes it's actually better to run a CS twice instead of having to screw around with mutexes to fit everything in a single program. And otherwise:

* Barriers
You can create a "waiting point" in your shader that ensures all elements have been executed till that point within a Warp or Wavefront. Compare it with walking with your family; each 10 seconds you are hundred yards ahead of grandpa, so you stop and wait till they catch up. Not sure why one task would finish later than another though. Maybe because of taking a different route through branching, yet to my understanding, all tasks would take that route then… Anyhow, see here, the Barrier instruction:


* Semaphores
This is to ensure you don't execute a specific block of code (usually involving reads or writes) if another element in the Warp/Wavefront has entered the same block. Ifso, wait until the other element is done first. Compare it to a ticket window. At some point, people have to line up and pass one by one. This is tricky shit though, do it wrong and your video card driver may hang & time out!

* Atomic operations
Sounds dangerous. OpenCL provides a couple of atomic operations (add, decrement, min, max, xor...). These do the same as their common equivalents, except that an “atomic write” ensures that it won’t conflict with another operation that is also accessing the same variable. Sort of a built-in semaphor. Keep in mind that some older hardware (like my GPU) may not support atomic operations yet though! You need extensions to enable them in OpenCL.



Next and last post will show a practical example that shows several techniques that wouldn’t be possible (or only with stinky workarounds) with traditional shaders, as well as using some of the synchronizing tricks explained above.

No comments:

Post a Comment