Josh Simmons

I do bad programming and complain about things.

Dec 30, 2012

Over the holidays I decided it might be fun to have another poke at my game/engine project that is in a permanent state of unfinishedness and flux. Of course I haven't managed to solve that fundamental problem but I did manage to get a reasonable first pass down on a task parallelism system. So I guess we can mark that down as a success!

The whole thing is pretty basic and the design almost entirely ripped off from Niklas Frykholm's blog post on Bitsquid's system. I also read a lot of Charles Bloom's thoughts on the matter, and promptly ignored most of his good advice with the vague idea that I'll fix up the remaining issues eventually.

Design

So anyway the task system revolves around a global 'staging area' of tasks combined with a priority queue of task IDs (32 bit handles also ripped off from Bitsquid) that workers draw from. Each task consists of a work function and a user data pointer. This leaves the interesting bits of multithreading up to the programmer.

The entire dependency chain is built on-the-fly as in Bitsquid's system. In order to allow interesting dependency graphs each task can have a parent task and/or a task to schedule when the task completes.

When submitting a task with a parent we increment the parent's open_children counter and whenever a task with a parent completes we decrement the parent's counter. Once the counter reaches zero the task is complete. Note that the task's own work item is considered a child in order to keep the code simple.

Whenever a task finishes (i.e the open_children counter reaches zero) we check whether the task has an allows_to_start field in which case we enqueue the linked task. This kind of backwards dependency keeps the code nice and and simple at the unfortunate cost of making code that uses the task system read backwards.

API

You submit a number of tasks, optionally giving a parent, and get a number of handles back. This stages the tasks but does not queue them for execution.

typedef struct RdxTask
{
    void *user_data;
    void (*work_fn)(struct RdxTaskPool *pool, void *user_data);
    RdxTaskHandle allows_to_start;
} RdxTask;

void
rdx_task_submit(RdxTaskPool *self, RdxTaskHandle parent, RdxTask *tasks,
        RdxTaskHandle *out_task_handles, size_t num_tasks);

Note that we pass the parent task only once per submit call, but the allows_to_start is defined once per task. This is just simpler to code, it means I can lookup a single task in submit and add num_tasks to the open_children counter. It also covers the main use-case where you'd be using a single parent to control a number of children. allows_to_start does not make much sense as applied to a group of tasks however, so we pass that as part of the task. We also pass the task pool to the worker function so workers can themselves stage, enqueue and even wait on more tasks.

Once we've submitted enough tasks to get running we can enqueue tasks which schedules them for execution in the priority queue.

void
rdx_task_enqueue(RdxTaskPool *self, int priority, RdxTaskHandle task);

We give the priority here once again to simplify the implementation. Each task scheduled from another task inherits that task's priority. It's yet to be seen whether this is a reasonable approach or even if a priority is worth having at all.

Now we have some tasks and they're probably off getting executed so we might want to block this thread until they do complete.

void
rdx_task_wait(RdxTaskPool *self, RdxTaskHandle task);

Wait just spins checking if the task given still exists and if so, if there are work items to dequeue and execute. This busy loop has a nasty case if a thread is waiting on the last enqueued task and another thread is executing but it works well enough for now.

Issues

At the moment the entire staging area and handle pool are protected by a single mutex which lends to quite a bit of contention (and hence context-switching). The reason for the single lock is that we keep the staging area array tightly packed. The packing means that any task finishing could lead to any other task having its position in the array changed and hence its handle being updated. This is probably just a bad design choice. I kept the idea of the packed array from an earlier implementation where I needed to iterate over the staging area but without that requirement the only benefit is a very simple submit procedure. I could simplify the handle system and improve the locking situtation by flattening all that code and using a sparse array and freelist.

The task queue is also a potential performance bottleneck although since I'm only storing small bits of data (64 bits per entry) and it's locked separately it's not a huge issue yet. Potentially worthwhile just ditching the priority rubbish and simplifying the whole thing.

I'm also only using a thin wrapper over the OS' semaphores, this shouldn't be an issue on linux but as I understand it semaphores in Windows are pretty heavy-weight items. It may eventually be worth wrapping those up benaphore style as mentioned by Charles Bloom.

Finally

The other thing that's completely ignored at the moment is thread affinity. I'd like to design the rest of the engine so that hopefully the need for such is low however there's always the frustrating bits of OpenGL where you have to render on the same thread which you create the window. For the moment I plan to serialise the final render and just run the important bits on the main thread outside the task system. Eventually though thread affinity support would be nice.

Well there's my post-implementation brain dump, maybe it makes some sense. At this stage I'm likely going to keep the system as-is and move onto other things since it seems to work OK and it's not really worth doing anything else until I can collect some science on it performs in practice.

Aug 15, 2012

The precise tragedy of build systems is that while writing one is a fun problem, using one never is. @rygorous

Yet another discussion regarding how bad all build systems are led as they often do to yet another build system discovery, in this instance tup. Since the author's prose suggests he actually enjoys what he's doing and since it doesn't use XML I thought I'd check it out. What follows are my initial thoughts from reading the documentation and using it in a small project. I'm not going to evaluate the large scale performance but it claims to be radical fast and scale at O(log²n) and I certainly don't have any evidence contrary to that claim.

Tup is a little bit different to your standard build system in the way it builds the dependency graph and hence the author calls it a 'Beta' build system. The basic gist of the whole thing is that instead of creating and walking the entire dependency graph on each run it maintains a persistent world view and creates an update list based on this and the file changes detected. This allows tup to scale incremental builds better than O(n) in regards to total project size. The persistent world-view also allows tup to do things other build systems dare not, most notably delete targets that are no-longer generated by the build system. This turns out to be a really nice feature as changes to the scripts never leave cruft, and you (theoretically) never require a clean build.

Interestingly as well as conventional declarative dependencies tup also intercepts the files that a command accesses and records these as dependencies too. This means that a correct dependency graph can be generated without resorting to command specific methods like pre-processing files for include directives. Tup also claims to be able to detect concurrent build non-detiminism deriving from lackluster dependency listing on the part of the programmer although since all my build scripts are perfect I'm yet unable to confirm. :)

There's also an option system that uses a kconfig compatible options file to pre-fill configuration variables used by the build scripts. The option system is made more interesting by the concept of a build variant which allows multiple builds to be made with different configurations. It's an exceptionally easy tool to use, all you need to do is create a new directory with it's own options file and call tup upd as usual. These variants are all handled properly with regards to incremental updates.

A conventional include system is implemented, as well as a thin layer on top that simplifies the use of a rules file at the project root. This works quite well in my small experience although I was initially hesitant about the number of build files accumulating in my project.

Lacking is any built-in support for targeting multiple platforms, there are two configuration variables pre-filled by the implementation @TUP_PLATFORM and @TUP_ARCH but these are simply set according to how tup itself was built. That said implementing such a system should be pretty simple given the powerful configuration and variant systems.

Tup also has a nice little monitor command which sets up a watch on the project filesystem and updates as files change. This can be set to re-build as necessary or to simply update the depdendency graph so a later tup upd will be as fast as the build commands can be. I have a permanent terminal setup now that runs tup monitor -f -a -j8 and it's a great way to write code. It's easy to forget how nice it is when you're not using a language like Java or C# where auto-build is common. It's especially nice when playing with the build scripts, I find it hard to overstate how nice it is that tup handles changes to build scrips gracefully.

As an example below are the tup script files that I've been playing with in my project.

Tuprules.tup

# This directive makes tup output a .gitignore file according to the files
# that it generates. Rather intelligent.
.gitignore

CC = clang -fcolor-diagnostics

CFLAGS += -Wall -Wextra -std=c99 -I.
LDFLAGS +=

ifeq (@(DEBUG),y)
    CFLAGS += -g -O0
else
    CFLAGS += -O2 -DNDEBUG
endif

# Variables in macros are evaluated at the place of macro instanciation.
!cc = |> ^ CC %f^ $(CC) $(CFLAGS) -c %f -o %o |> %B.o
!ld = |> ^ LINK %f^ $(CC) %f -o %o $(LDFLAGS) |>
!ar = |> ^ AR %f^ ar rcs %o %f |>

include @(TUP_PLATFORM).tup

linux.tup

SGL_BACKEND = glx
LDFLAGS += -lX11 -lXi -lGL

tup.config

CONFIG_DEBUG=y
CONFIG_BLAH=n

Tupfile

# This includes the top level Tuprules.tup from anywhere in the src tree.
include_rules

CFLAGS += -I./sgl/

: foreach *.c |> !cc |> {objects}
: {objects} sgl/libsgl.a |> !ld |> tek

sgl/Tupfile

include_rules

: foreach $(SGL_BACKEND)/*.c |> !cc |> {sgl_objects}
: {sgl_objects} |> !ar |> libsgl.a
Jun 16, 2012

I don't really write much of value on the internet (or at all really) and this is my latest attempt to change that. scriptogram is the latest tool I've employed to further this goal; it's a gloriously simple Markdown fed, jekyll-like (if not jekyll based?) hosted static blog platform backed by Dropbox.

The long story short is that you write posts in markdown, put them in your dropbox and voila.

Knowing myself as I do, this post will probably be all that comes of the effort but hey it's better than nothing. Kinda.