Pushing to the PS3 SPUs with Offload

Pushing to the PS3 SPUs with Offload

By Colin Riley

July 12th 2011 at 4:27PM

Codeplay's games technology director Colin Riley on iterative optimisation

The current PlayStation 3 programmer workflow for moving code to the fast and compact SPU cores from the main PPU core is error prone and incredibly hard to do without needing multiple changes to other game subsystems.

You need to create small self-contained sub-programs, commonly referred to as 'jobs' or 'tasks'. Job communication and main memory access need to be done via Direct Memory Access (DMA). As the SPUs only have 256KB of store for code and data, the way you manage data access is key to performance. Manually migrating code to SPU requires lots of pre-planning, editing data structures, and moving data via DMA so it is local to the SPU. Editing data structures is usually a nasty element, as if they are in wide use elsewhere in the game, changes can impact other programmers on already tight schedules.

The Offload Toolkit is the result of over five years of research from Codeplay. With the Offload Toolkit we take another approach, one that most programmers know in some form already - iterative optimisation.

CROSS-HEAD://GO WITH THE WORKFLOW

The typical workflow for migrating code to SPU with Offload consists of finding a block of C++ code we want on SPU, via existing methods such as profiling. The Offload Toolkit at its heart contains a single-source compiler, which can output both SPU and PPU object binaries. This allows programmers to simply move code onto the SPU using an offload block structure.

The compiler will do all the hard work – duplicating functions, types, and pointers for use on the SPU; access main memory via an auto-generated software cache for simplicity; change any use of PPU Altivec intrinsics into optimised SPU alternatives - there is a massive amount of extra work being undertaken by Offload.

Even the scourge of the SPU programmer - virtual functions - will be handled by Offload. Virtual functions are fully supported - even when the object is not in local store - as long as programmers specify a list of functions that can be invoked within that SPU job. The runtime helps here, as if a function is called which doesn't exist in this list, the runtime alerts the developer to the function who can add it later.

Once the code is running on SPU, we can then actively profile on the target hardware - an element  which gives Offload a great advantage. Instead of estimating and doing lots of pre-planning without any profiling data when manually migrating, you can quickly test feasibility of code. The job will be running slowly on SPU via the software cache - but this allows for analysis to be performed on the memory accesses within the code blocks.

CROSS-HEAD://TOOLS OF THE TRADE

Within the Offload Toolkit is a library of template classes for moving data seamlessly into SPU local store without going through the software cache. These templates will optimise away on any other platforms, as to aid cross platform development.

We can profile the code, finding the areas that are memory bound and use the library classes to move the data onto the SPU, leaving the software cache to handle only a few random reads and writes that would be impractical to migrate over.

That last point is important - manually migrating code to SPU does not give programmers this option. You must patch all data, or manually DMA it over, or implement your own caching system. Things which can introduce multiple points of failure and it requires more testing and programmer time. Usually this work needs done by PlayStation (3) platform experts, as it can involve some low-level pitfalls - whilst Offload uses standard well known profiling methods.

What using Offload helps with is actually getting more code off the PPU and onto SPU. We've seen complex AI code migrated onto SPU within an hour. With no source code changes apart from the introduction of the offload block, and running through a software cache, the code ran three times slower than the original PPU version.

However, given another seven hours of iterative optimisation work, profiling memory access and then editing 20 lines of source with cross platform templates, we managed to get the code nearly four times faster than the PPU version. With Offload, instead of stabbing in the dark you can make real judgements due to the fact the toolkit allows for rapid deployment to where it is needed: the SPU.

Codeplay offer the Offload Toolkit to all registered developers with a wide range of licence options, ranging from small indies all the way up to publisher level. Those wanting more information can visit our website or contact us.