07-26-2010
I stumbled upon a very interesting post on concurrency. The author criticizes the transactional memory approach to simplifying multi-threading programming, compared to classical lock-based multi-threading models. Indeed, it is easy for a novice programmer to create a mess when using locks, the typical feared result being a system that exhibits random, very difficult to diagnose crashes and deadlocks.
What caught my eye is the focus of the author on a single reason why the TM model is flawed: complex software systems do not merely operate on memory.
This strikes me as important because I've observed a similar issue with garbage collection systems and languages. Like multi-threading programming, dealing with memory is difficult and it is easy for novice programmers to end up with a unmanageable mess of dangling pointers and memory leaks. And the solution? Simply don't worry about memory -- use as much as you need whenever you need it, the garbage collector will do its magic and it'll just work.
The problem with garbage collection systems is exactly the same: complex software systems do not merely operate on memory. Indeed, garbage collectors themselves do not manage memory. They manage objects lifetimes.
In the twisted mind of a Java developer we shouldn't worry about when the object will be collected: it can lurk around until an opportune moment arrives when the system has nothing better to do, and then it'll take care of it; and because the only type of resource most objects acquire is memory, most of the time we don't care when the object will be collected.
Yet, in the real world we must deal with all kinds of resources: files, sockets, locks, etc. We can't be sloppy with their acquisition and release. Their lifetime is just as hard to manage as memory and it is just as easy for a novice programmer to create a mess.
Solve that problem with finally and you're back to square one.
07-06-2010
Recently, here at Reverge we have been adding DirectX 10/11 support to some of our internal libraries. In the process we discovered a couple of problems in the new API design.
In DirectX 9 there were three types of "surfaces":
With this system, pure textures can be used by the CPU to efficiently write data to video memory for the GPU to read and render targets can be used to efficiently read data that was written by the GPU.
Textures that are also render targets can not be accessed by the CPU at all, but that's fine because if the CPU needs to read the result of a GPU pass, that pass can render into a render target (that is not a texture) which the CPU can read.
In DirectX 10/11 render targets that are not textures are no longer supported, so while the CPU can efficiently write data for the GPU to read, there is no efficient way for the application to read data written by the GPU. Instead, it is required to copy the data from a render target texture into another (non-render-target) texture which can then be mapped by the CPU for reading
In the good old DirectX 9 days, if the application needed to write data into a texture, it called the LockRect function to get a pointer. The data was written in the memory pointed by the pointer, and then UnlockRect was called.
The nice thing about this API is that it does not specify what kind of memory the returned pointer points. Depending on texture creation flags, the requested access (read/write/write-discard), the current state of the Direct3D device, and the capabilities of the graphics card, the driver might choose to map the texture video memory directly, or it might map a separate temporary video memory buffer, or it might even return a pointer to a system memory buffer which is copied to video memory when UnlockRect is called.
Instead, in DirectX 10 and 11 there seems to be only one choice: video memory is mapped directly if possible; if not, we get an error.
The new semantics are in fact in accord with a major overall architectural shift introduced with DirectX 10. It calls for removing the so-called "API magic": undocumented internal behavior designed to hide inefficiencies and/or lack of capabilities in the hardware. The rationale for avoiding API magic is that it makes application performance inconsistent across different hardware and driver versions.
It is true that depending on the actual behavior of LoctRect, the application performance can vary significantly. That is indeed a downside, but I wonder if the designers of DirectX 10 and 11 are old enough to remember DirectX 3. It also lacked any API magic. The idea was that the application would create execute buffers that would reside on the graphics card, removing all abstraction and maximizing performance. Yet, because the interface was designed with hardware engineers in mind, it was basically unusable. The more abstract DirectX 5 interface was much easier to use yet (surprise!) didn't lead to lower frame rates.
Similarly, the more abstract interface of LockRect in DirectX 9 is more practical than the less sophisticated DirectX 10/11 Map behavior. Sure, in the simplest use cases the difference is negligible, but then again in the simplest cases LockRect wouldn't use API magic either, so the user would not be penalized by it. It is the non-trivial, less frequent yet performance-critical uses that need a more abstract interface.
For example, if the texture is created and locked with the correct flags, DirectX 9 could let the application download new data to a texture even while it is currently used by a shader, which isn't possible in DirectX 10/11. How does that work in DirectX 9? If write-discard locking is requested, the driver could map a separate temporary video memory buffer for access by the application. As soon as the device is done using the texture, the new memory buffer can be associated with the texture at the cost of a pointer swap.
It can be argued that this is done behind the user's back: if that's what the application wanted to do it could do it itself right? The problem is that this type of systems are difficult to program. Besides, even if the user has the skill and resources needed to provide a high-quality implementation, they can't possibly compete with a driver programmer who is not only a specialist but also has the advantage of writing hardware-specific code and is enjoying much lower-level access to the graphics card.
06-01-2010
One problem I've always had with MSDN is that whenever I search for something, it finds stuff that I really don't care about. Typically, I'd look for a C function from the Windows API or DirectX, and it'd find me some Visual Basic function which resembles the name of the C function.
I thought, OK, maybe few people use C any more. So I'd just shrug and try another search.
However, it seems that the problem is deeper than that. Just now I looked up D3D11_PRIMITIVE_TOPOLOGY on MSDN and I got the following result:
Obviously this search query is very specific. There is no way D3D10_PRIMITIVE_TOPOLOGY to be a better match than the D3D11_PRIMITIVE_TOPOLOGY page that I know exists on MSDN (click here to see current MSDN search results for this query.)
Is it that there is something wrong with my search? Google doesn't think so: with the same query, the D3D11_PRIMITIVE_TOPOLOGY page is at the top of its search results.
Is this a problem with MSDN itself? Maybe it's using old search technology that Microsoft hasn't been able to update yet? Surely, Bing can do better? Not really, here is what Bing found:
(Click here for current Bing search results for this query.)
So there you have it: Google seems better at searching the MSDN pages than Microsoft. :)
05-18-2010
Through my career I've been working with proprietary graphics APIs, but Apple platforms use OpenGL and since it is also available on Windows and on other OSes, it seems the right choice for graphics today.
It also feels good to support an open standard like that. I love open standards. For example I find it ridiculous that Windows is not POSIX; even though it does serve their corporate interests, Apple demonstrated that the same corporate interests can be served with a POSIX-compliant OS.
And then OpenGL showed its ugly head.
How has OpenGL survived for so many years? By extensions.
Every time a graphics chip manufacturer implements a cool new feature in hardware, OpenGL allows them to provide whatever functions they feel like in the form of extensions. For example, if ATI needed to implement a cool new function, they simply document it and compile it in the video driver shared libraries/DLLs, using a function name like glCoolNewFeatureATI.
If two or more graphics chip manufacturers agree on the same API for a cool new feature, then the OpenGL stardard allows them to use the suffix EXT: glCoolNewFeatureEXT. That way, programs can look for and link to that function without having to know what graphics hardware they're running on.
Then there is the OpenGL Architecture Review Board which looks at various extensions that appear and if they like an EXT function, it gets the ARB blessing and vendors are allowed to use the suffix ARB with the name, so we get glCoolNewFeatureARB.
Finally, most (all?) ARB functions are made part of the next revision of the OpenGL standard, at which point all suffixes are dropped and we're left with glCoolNewFeature added to the rest of the OpenGL functions.
That's how committees work: because nobody is in charge, nobody can take the responsibility for making the correct architectural decision. Because nobody can take that responsibility, it is necessary to have a cumbersome system of checks and balances to minimize the possibility of breaking the standard.
But the result is a mess anyway. Users end up with potentially 4 different versions for many OpenGL functions. Which one should they call?
You might think that a reasonable approach is to look up the suffix-less version first, if that fails look up the ARB version, and if that fails look up the EXT version, and finally if that also fails just announce to the user that their OpenGL driver sucks and exit. But that's easier said than done.
For starters, some of these functions take enums as parameters. The names of the enums also feature the corresponding suffix. For example, a possible value for the first parameter of glBindBufferARB is GL_ARRAY_BUFFER_ARB, but if you're calling glBindBufferEXT you should be using GL_ARRAY_BUFFER_EXT instead, and if you're calling simply glBindBuffer then you'd use GL_ARRAY_BUFFER.
This arrangement makes it difficult for the graphics programmer to abstract the logic for selecting the correct function: it isn't sufficient to simply store a function pointer or something similar. The only viable solution is to create a wrapper layer over OpenGL with its own naming conventions and its own enums, which then get translated depending on which of the 3 or 4 variants of the function should be called on the target platform.
Or, you could just pick one variant and use only that or fail if it isn't available. For example, I thought that it is reasonable for my program to require vertex buffer support in the OpenGL driver, so I started writing code which simply calls glBindBuffer without any suffixes. And it turned out that on my system glBindBuffer is available but is broken. glBindBufferARB is also available, and it works fine.
So, I feel stuck. So much so that blogging about the problem seemed more appealing than trying to figure out what's the correct solution. :)
03-28-2010
What happens when assertions fail? It depends.
Java throws an exception.
C# and Visual Basic display a dialog box by default but provide an interface for extending and customizing program behavior when assertions fail.
In C and C++, the default behavior for assertion violations is to abort the program immediately. Not surprisingly, many C and C++ programmers create their own assertion-checking macros that are, well, "smarter".
What is the correct behavior then?
Some library developers would answer that there is no single correct behavior, and they'll point out at all the different behaviors people find useful. Why make a difficult design decision when you can shift that responsibility to the user?
What this train of thought is missing is that when an assertion fires, we can no longer reason about the state of the program; there is no way to know what will happen if the program is not terminated.
"But it might be important to minimize the damage! In some instances, ignoring a particular assertion might be better than terminating the program!"
Certainly, except that when something the programmer considered impossible has just happened, all bets are off. Aborting seems drastic? Indeed it is, but any other alternative would be like making a zombie run a marathon.
Besides, it isn't always realistic to assume that the user knows the correct answer. What would you reply to
Assertion failure: "Your nuclear reactor has overheated. Abort, Debug, Ignore?"
:)
02-04-2010
Here are some computer language memes for your enjoyment. :)



01-07-2010
As far as I can tell, many of the readers of my blog find it through my posts on a couple of mailing lists: the sweng-gamedev@midnightryder.com game development mailing list, and the Boost developer mailing list.
I've blogged a few times about shared_ptr in the hopes of bringing those two audiences to appreciate the exclusive low level, C-style features of shared_ptr. The Boost crowd tends to think of shared_ptr as of any other smart pointer which simply calls delete automatically. The typical game development programmer basically ignores shared_ptr altogether, assuming that all it can do is call delete automatically (and they know when to call delete, thank you very much!)
I'll continue illustrating my point by sketching an interface design for socket programming using shared_ptr, based entirely on the standard C socket API structs.
The typical C++ approach would be to create an "object-oriented" system of classes, with all kinds of encapsulation and safety-net goodness throughout -- yet such interfaces have a fundamental problem. The whole point of defining a C++ socket programming layer is to provide type-safety for the low-level int handles. But to make the high level C++ interface practical, we have to punch a hole in the type system by providing a "getter" (and even a "setter") for the low level int handle!
And this means that we can no longer rely on type safety.
The important thing to consider is that the C interface is standard: sockets are represented by ints and that's that. The only reasonable way to improve the C API is make it easier to use, and harder to make silly errors. Below, I'm providing a few simple C++ functions which do just that.
First, here is how shared_ptr can be used to take care of closing a socket file descriptor when it is no longer needed:
The socket_wrapper class is an internal type, hidden in a CPP file where socket_adopt is defined. That function is intended to be called as soon as a socket file descriptor is returned from the standard C socket API, or from any other 3rd-party socket interface that opens a socket. It simply uses the shared_ptr aliasing constructor to rebind the initial shared_ptr<socket_wrapper> to point to the int stored in the socket_wrapper object. When the last shared_ptr instance expires ~socket_wrapper() will be called automatically, yet the user sees a simple shared_ptr<int const> which points the socket file descriptor!
The socket_create function below demonstrates how socket_adopt can be used:
The old school gethostbyname function is now superseded by the POSIX function getaddrinfo. It is used to convert DNS names and IP addresses from their human-readable text form to a structured binary format for the OS:
The caller passes an address and describes the type of acceptable protocols, and the function returns a linked list of addrinfo objects in res. Each returned addrinfo object's ai_family, ai_socktype and ai_protocol members can be passed to the socket() function (or to socket_create above) to open a matching new socket object. The linked list is allocated dynamically by the OS and must be disposed by calling freeaddrinfo. As before, we define a simple wrapper function that returns shared_ptr:
Remember, the returned shared_ptr object points to a linked list of addrinfo objects. We could keep the original shared_ptr around and walk the nodes using raw pointers, or we can advance the shared_ptr itself to the next node:
The above function again takes advantage of the shared_ptr aliasing constructor. The returned shared_ptr object points to the next node in the addrinfo list, but when freeaddrinfo is called, it will be given the original addrinfo pointer, the one passed to the original shared_ptr object in socket_getaddrinfo.
The obvious benefit of using shared_ptr<int const> to manage the lifetime of sockets and file descriptors is that it decouples us from having to know how the file descriptor should be destroyed.
It is also possible to use type erasure to convert the shared_ptr<int const> to a shared_ptr<void const> and pass it to some kind of system which is concerned only with objects lifetime management regardless of their type.
Finally, shared_ptr comes with weak_ptr support. One example where this could be handy is in a logging system, which could easily detect if the log file has been closed by the application before attempting to write something in it.
11-19-2009
As I'm typing this post, there is a continuing discussion on the Boost developers mailing list about what to do with compile warnings reported in Boost code. There seem to be two opinions on the subject.
In the blue corner, we have people arguing that the compiler is your friend and if it tells you that there might be a problem lurking in your code, you should appreciate that and take whatever action is needed to "fix" the warning.
In the red corner, people argue that warnings don't necessarily indicate problems, that some compilers are plain silly, and that each Boost developer should be free to pick whatever warnings level is appropriate for them.
But before anyone has an opinion on the subject of warnings, we need to be on the same page about
A common misconception is that compiler warnings indicate problems in the program that are not as severe as errors but should nevertheless be "fixed".
Actually, warnings report facts about the program being compiled for which the C/C++ standard does not require the compiler to issue an error. We can classify them in the following categories:
In a corporate environment, such classification helps managers craft a sensible warnings policy that maximizes the output of a particular development team targeting a particular set of platforms.
For a project like Boost, realistically, the classification is much simpler:
Except that by definition 1) is a subset of 2), since many Boost users compile in environments that enable all warnings.
Therefore, we're left with
which is another way of saying "any warning issued by any version of any compiler on any platform Boost is compiled on."
Realistically, the only sensible way to achieve this goal is to simply suppress (disable) all warnings in Boost releases, preferably without disabling them for Boost developers:
The only alternative is to require Boost developers to "fix" all warnings on all versions of all compilers on all platforms Boost is compiled on; (obviously?) "fixing" only some warnings that a committee of some sort labeled as "important" still requires disabling all (other) warnings, assuming we are committed to provide warnings-free user experience.
And if that's not our goal, then we should be satisfied with the statu quo: Boost developers address most warnings reported by Boost users.
10-26-2009
I'm currently working on a library in CUDA that implements the basic operations for dynamic, very large size vectors and matrices. You'd think that such a basic scientific library should be already available, but I haven't been able to find one (if you know of such a CUDA/GLSL/HLSL library please do let me know!)
Most scientist find it unacceptable to have to even think about matrix multiplication. Basic things like that should just work, and so LAPACK (which of course does a lot more than multiplying matrices) was developed many years ago to solve all of these problems once and for all.
Unfortunately LAPACK is incompatible with the current GPU platforms.
LAPACK is a software library for numerical linear algebra. It is implemented in FORTRAN but bindings to virtually any programming language are available.
Because LAPACK is a rather large piece of software, it defines a minimal set of lower level functions over which all LAPACK operations are implemented. They've named those functions BLAS, or Basic Linear Algebra Subprograms. The idea is that LAPACK would run efficiently on any platform that implements BLAS efficiently.
We don't need to dig any deeper than the Wikipedia LAPACK page to see what the problem is:
Current GPU architectures are anything but cache-based architectures. Moreover, they can't efficiently share data with the CPU. So a BLAS implementation that targets a current GPU platform has no choice but to copy data to GPU memory, perform the operation (very efficiently) on the GPU, then copy the result back to CPU memory for LAPACK use.
The net effect is that while a GPU architecture can multiply large matrices way faster than a CPU architecture, the required double memory copy cripples cuBLAS (the CUDA-based BLAS implementation from NVIDIA) and, well, makes it almost useless. I'm sure there are cases where it would outperform a simple CPU-based BLAS implementation, but I suspect that a SSE2-enabled BLAS will run circles around cuBLAS with any data set.
There are two ways this issue can be resolved. One, the scientific community could move away from LAPACK. Two, the GPU platforms can become friendlier to the LAPACK/BLAS architecture.
It would be difficult for the scientific community to abandon LAPACK. Perhaps they can be persuaded by the huge speedups they would get by targeting GPU platforms (through a different API, not LAPACK), but the problem is not only inertia: there is a lot of scientific code that will have to be rewritten if it is to be ported from LAPACK to anything else.
There is some hope for LAPACK in the future though: Intel's Larrabee seems to address all problems that cripple BLAS, yet promises to be able to run massively parallel programs as efficiently as current GPUs do. The GPU companies are also slowly moving away from a completely cacheless model, but this treads on CPU territory and the thing is, a laptop only needs one chip that can run "CPU code" efficiently.
Massive data computation is memory bandwidth bound. This is even more true on GPU platforms which are designed to benefit from memory access latency to maximize bandwidth.
Currently, a good optimization strategy is to squeeze more things into available bandwidth. This is especially easy to do in graphics and image processing: depending on your requirements, you could use a variety of pixel formats, and spending fewer bits per pixel is an easy performance booster.
Note that even when input images are of reduced quality, all processing and the produced results can still use high precision without increasing (input) bandwidth. Video games use this technique a lot: typically they produce high quality picture, yet some of the input textures can use as few as 2 bits per pixel.
The same should be possible when working with large vectors and matrices as well: if you know that your input or intermediate or output data doesn't need the full 32-bit floating point precision, you should be able to easily use 16-bit floats. This gives you 2x speedup with no effort at all.
As well, there are use cases for large matrices or vectors that have all elements either 0.0 or 1.0, so even one-bit-per-element formats should be supported. Though such cases aren't very common, a 32x speedup should not be underestimated.
10-20-2009
I am guessing there is some trace of "proper" game programming mentality in me after all. :) Going against my previous rant, I've designed "yet another" vector/matrix math library and I've submitted it for a preliminary Boost review.
Why do I think that writing this library was a good idea? Because it makes no sense for any of us to have to spell out how a 3x3 matrix is multiplied by another 3x3 matrix; there should be a way to express that algorithm generically and apply it to any and all 3x3 matrix types in the world.
The only tricky part is that operations such as matrix multiplication should use operator overloads (seriously, it's retarded not to) and that presents a challenge: how do you define type-safe operator overloads without using specific matrix types? That's basically what (Boost) LA pulls off, using SFINAE.
A user-defined vector type float3 can be introduced to (Boost) LA like this:
After a similar specialization of the matrix_traits template for a user-defined 3x3 matrix type float33, a full range of vector and matrix operations defined in (Boost) LA headers become available automatically:
The full documentation and source code, released under the Boost Software License, is here.