As Unity Ads engineering team, we got invited to the DEVOPS 2018 conference in Helsinki in December 2018, to present how we have approached the challenge of scaling up our systems to handle a 10x increase in load over a period of four years.
Among other things, we discussed what we learned from our first incomplete migration attempt and how we applied that knowledge when successfully migrating our large-scale infrastructure to the Google Cloud platform. This blog post follows up on our talk with more details on the topic, and some general advice for similar big projects.
Video from the presentation is available here: DEVOPS 2018 – Unity Technologies Keynote, Day 2
Our mission at Unity is to help you become successful with your games, whether you’re a publisher who’s building a business around games through monetization with ads, or an advertiser aiming to acquire new users. Over the last four years, the number of ads shown per week in mobile games integrating our Unity Ads Monetization SDK has increased from 250 million views per week to 2.5 billion views per week.
For an ad platform like Unity Ads, the amount and quality of data we have access to improves our ability to show relevant ads to users. This helps advertisers reach the intended audience and publishers to receive higher ad revenue in their games.
The above slide shows the growth in traffic, and how we from engineering have introduced architectural changes to allow our system to scale accordingly. Like many other successful software products, we started out with a monolith system, all the logic for handling publishers, advertisers, as well as the ads delivery itself was in the same codebase. With the increased load on our services, we had to address different aspects of scaling our services, from splitting the codebase into a microservices based architecture, to migrating and consolidating our legacy cloud infrastructure.
The primary goal for the engineering teams at Unity Ads is, simply put, to enable fast growth in our monetization business. So when we have experienced a 10x increase in the number of requests to our backends, we obviously had to scale our systems accordingly to continue serving ads, and also improve the tools that allow publishers, advertisers and our account managers to handle the increased amount of games using Unity Ads.
Solving a seemingly steep technical challenge, especially on a larger scale, most often involves addressing the people and organizational aspects of the problem. In our case, the challenge was to regain ownership of an infrastructure which was originally built when our service had a much smaller load than is the case today, as shown in the diagram above.
Our first attempt to migrate infrastructure
To ensure our infrastructure and services would be able to scale in the future and reducing bottlenecks in the organization, we knew we needed to perform a migration of our cloud infrastructure, both technically, but also in a way that would allow development teams to fully own their services, including cloud infrastructure parts, in order to reduce external dependencies for teams.
Our first attempt to migrate was kicked off mid-2017. However, relatively early after starting the project, we noticed the following problems in our setup of the migration, which eventually lead to the decision of stopping the project, as it was clear it would not result in the outcome we were looking for:
- Lack of clear ownership of migration, causing individual projects to stall, as there simply wasn’t a clear process of who would be driving the migration, operation (Ops) or development team.
- The project was primarily Ops based, and development teams involvement was limited, so we didn’t have required knowledge about infrastructure spread to development teams.
- Development teams continued developing features, so migration was a constant moving target.
- We did get some smaller services (in terms of traffic volume) migrated but failed to move major large traffic components over, as critical issues were identified too late in the migration process, and it was unclear who was responsible for addressing those.
- The project wasn’t connected to business goals, which meant migration would often get prioritized lower than the development of new features.
Development teams driving the cloud infrastructure migration
The primary learning from our first attempt above was to address the lack of clear ownership in the migration process, giving the development teams ownership and control of the infrastructure, reducing external dependencies for teams. Based on this we started our current cloud infrastructure migration project in August 2018, and at this time we’re almost done migrating all of our services to the Google Cloud infrastructure. To learn more about our partnership with Google Cloud, see this blog post.
We wanted to empower individual development teams so they could control and own as much of their infrastructure as possible. At the same time, we needed to reuse shared modules. So we came up with an “internal open-source model”, meaning that typically one team is the maintainer (in open source terminology) of a module, keeping track of changes, reviewing PRs and basically ensuring that other teams are able to easily contribute to the module.
We do have a small DevOps team, however, the role of this team is to work with the development teams on identifying common infrastructure requirements across teams, and always working closely with the teams to help them without blocking their work.
Based on our learnings, we have addressed the problems we faced in our first migration project in the following ways:
- “Lack of clear ownership” and “Primarily Ops based approach”: Each service is owned by a single development team, which is driving the migration, reaching out to others when help is needed. Migration projects aren’t handed over to other teams in the middle of the process.
- “Not connected to business goals” and “Development teams continued developing features”: Migrating to Google Cloud became a business objective, meaning that the majority of development team members would work solely on migrating, avoiding context-switching between migration and feature work.
With the above model, we have found a healthy balance between having each team own feature delivery and deployment of their services, and at the same time reducing duplicate work across teams by reusing shared components used in multiple services.
We’re working on a follow-up in-depth blog post describing the tools we’re using, i.e. Kubernetes, Terraform, Jenkins/Gitlab CI and Helm. If you love working on cloud infrastructure just as much as we do, consider joining the team! Check out open positions on https://careers.unity.com.