Criterion: How we can kill crunch
Thursday, 8th September 2011 at 3:00 pm
Paul Ross, CTO of EA's flagship UK studio, explains how tech can cut overtime...
WHEN I JOINED the games industry in 1996 crunch was already an established part of industry culture.
While crunch is by no means entirely peculiar to the games industry – junior doctors regularly quote 100-hour weeks – tales still abounded of people pulling all nighters, and everybody seemed to know somebody who worked 48 hours straight.
It’s perhaps unsurprising that an industry built and pioneered by young, single people, unmarried and without families developed the crunch culture.
In the mid-‘80s projects ran anywhere from weeks to a couple of months. A two-week intensive crunch at the end, to our early forefathers, really wasn’t that big a deal.
Unfortunately as the budgets increased, and teams grew, so did crunch; until it started to get very painful towards the late ‘90s and 2000s. Crunch suddenly started to last as long, and in many cases longer, than those early projects that spawned our industry.
Today the discussion around crunch seems to centre around three main themes.
Theme one is often proposed that crunch comes from bad management. If only the management could be better then we wouldn’t crunch. Depending on the context ‘better’ usually means improved scheduling, or superior design briefs.
Theme two, the counter point, holds that crunch occurs because people don’t deliver what they say they’ll deliver, when they say they’ll deliver it. You set up a vision, you build out the resourcing, things start to slip, and so everybody ends up crunching to get to where we thought we would be.
The final idea, theme three, is that crunch is some sort of cynical business strategy designed to maximise profits for shareholders.
The only thing that everybody can agree on is that crunch is bad. It’s bad for the team crunching, for whom it’s very unpleasant.
It’s bad for the management running that studio who are suddenly facing the double whammy of a tarnished reputation coupled with a recruitment challenge to cover attrition.
It’s also bad business practice because the workforce you invested in leave to be ‘cheese makers’ or whatever therapeutic balm they decide to apply.
The thing is, very few people ever offer practical advice on what you can do to avoid crunch, outside of ‘better scheduling’.
But how can you schedule something as creative as game making? Sure, scheduling works for building bridges, tunnels and houses where it’s a known quantity, but the creative games industry doesn’t work like that.
In between the blame game there are very few ‘if you only did x, y, z then it would all be alright,’ that are offered.
Here at Criterion we’ve made significant inroads in how we can, if not eliminate, certainly ease the crunch process.
Our secret to alleviating crunch is to make every second of development count. The more you can keep the production line running the fewer wasted hours your team has to make up with crunch.
Criterion’s answer to this was to invest massively in our build infrastructure.
We invested heavily in build servers, auto testing and tools to figure out what was happening during our production processes.
By doing this we made our production far more efficient, resulting in savings from our QA bill of around $700,000 for example, as well as limiting the amount of overtime we had to put in to get the BAFTA-winning game Need For Speed: Hot Pursuit done.
So what form did this take? Well, before you do anything you need to keep the game building. The code, the data and all of the tools have to build at all times. Doing this avoids members of your team getting latest and finding they can’t work.
Every hour of downtime that you avoid is an hour that you can put into game quality and an hour that you’ve eaten away from crunch.
In fact, if you have a fatal breakage that goes in and takes out 40 members of the team for an hour each, well that’s one week of development for somebody that you’ve lost instantly.
MECHANISMS OF CHANGE
To keep things building we used two mechanisms. The first are top of the range build servers that constantly churn through the game and all of its branches.
Once you have the game constantly building, you need to inform people when they make mistakes. We had a very simple mechanism that involved our build servers emailing people who had submitted changes to a broken build.
They could then see what had gone wrong and submit a fix. This basic mechanism keeps the build building at all times.
In the example from our codecheck above you can see we’ve got our total clean build time down to 22 minutes and in this particular case somebody has checked some code in that fails to build. This can be picked up quickly and a fix applied so that down time on the team is minimal.
Once you have the code and data constantly build the next challenge is to keep everything running together so that when developers, QA and reviewers get drops of the game they have confidence that what they are working with is stable and works.
This is where building out large test farms is the answer. A 22-minute build turnaround gives you 15 minutes of tests to run. The larger the test farm, the more diverse the tests you can run in such a short period of time.
We found that peer-to-peer networking was the most brittle, and more likely to break and so we set up our test farm so that we could easily test maximum players on every build.
Doing this we managed to test to a higher stress level than QA or indeed, anybody who was playing the game.
The net result was that what we produced was stable. We were even finding bugs before our QA team could fix them.
We utilised the email mechanism again to inform developers when something they had checked in threw an assert. Due to this we were always aware of the state of our build and it pretty much ran all of the way through development.
In this case we have an assert that has been thrown which would be fixed and the smoke tests returning back to ‘green’.
If you develop with this mentality, then:
* Your producers can always review the game in its entirety
* QA fast forward straight to finding explorative bugs
* Your content creators always have a working build that they can synch to and work on.
All of this builds development efficiency, saves time and helps to ease crunch throughout the project.
Finally, once you have this infrastructure in place you can start to utilise the greater intelligence you have about your production. We logged all of the asserts, and exceptions, to a central database.
We then had an idea of when ‘seen onces’ actually were. The auto testing is generating thousands of hours of testing every day.
We’d know when it went in, and we’d know when it was gone. Due to this we could eliminate those ‘hard’ – read: expensive – bugs from production.
This meant the final run was simply fixing exploratory bugs, and the class B’s & C’s which are usually easier to fix and we had many thousands of test hours to mine to see how often something had been seen.
All of this saves the time of programmers who are not looking for low repro seen once because we have active telemetry.
This saving is increasing quality and easing crunch. That extra programmer time ripples out to better tools for artists, to make content creators lives easier, and better features implemented in game, to help producers achieve quality goals.
Finally we had a system of working out our team velocity. We simply counted change lists being submitted into the build. It’s a coarse-grain measure of how quickly the team is moving, but history has shown the trend to be revealing. Before major presentations we would see a spike in changes.
During the 18 months of production we only went to a six-day week during the last six weeks of the project.
During that time we found that the rate of changes collapsed during the weekend. People were in at different times, some couldn’t make it, and collaboration was hard under those circumstances. We then adjusted the weekend pattern so that we simply fixed any bugs in the bug database. Change lists shot up again, it’s another example of how we made every second of development count based on the increased intelligence that we had on the project.
The results? Well, attrition has been at a steady two per cent from Criterion over the past five years. Whilst I won’t say that we’ve completely eliminated crunch we can now significantly manage it.
We have a studio with a higher average staff age, who are married and have families. We cut our QA budget in half, we went alpha with only 100 bugs in the database.
We managed to focus almost all of our development time on quality. Everybody had a source base that they could work with at all times. We shipped a great game, and picked up a BAFTA for our multiplayer along the way.
So much talent is being shipped out to Canada, and the US, that the UK games industry is struggling. We’re not creating enough top quality grads to replace those leaving the industry or the country. Despite being a massive exchequer earner for UK PLC, we are in decline.
Not due to the quality of our offerings, but due to the decline in numbers of top quality people making great games. Working out how we make games, whilst supporting our talent, is one of the greatest questions facing the UK games industry.
Due to taxation we simply can’t compete on an equal footing with North America, and so we have to diversify. Providing healthy development environments seems like a good way of attracting our British talent
back to the UK.
Paul Ross is a veteran at Criterion Games with over 15 years at the studio, working on award winning titles such as the Burnout series and most recently Need for Speed Hot Pursuit. As CTO Paul is responsible for ensuring Criterion stays on the cutting edge of gaming technology.
© Develop 2013. All rights reserved.