PokéForce Development Diary #4: Systems Architecture
First and foremost, I would like to preface that this development diary is intended to be aimed at an audience interested in the more technical aspects of PokéForce, specifically the back-end code and architecture that makes everything tick, rather than the intricacies of game mechanics or other more visually pleasing aspects of the game. In this article I intend to shed some light on our development process so far, and what we can expect to change in the coming months as we work towards our next iteration of development in the form of snapshots and streamlined publishing processes.The Spring Demo
Let's face it, the spring demo was a tumultuous experience for not just the players, but us as a development team. While many factors contributed to this, it boils down to one central piece: rushed development and technical debt. Technical debt is a term coined to refer to decisions in technical processes such as programming that may prove to be difficult to work with or refactor in the future. That is, systems which may serve a specific purpose but may not be designed to be very malleable when interconnected with the greater product. It happens to every developer and application at some point (and often, at every point), and the recent Spring Demo was where we saw our most recent culmination of this. Throughout much of the development processes, code was rushed to try and bring elements of the game to life, often without care for implementing any form of automated unit or integration testing.That is to say that, often when developing a feature for the game, it would be briefly tested by the developer who authored the code, handed out to the beta testers (often in large, infrequent update batches, and often only a short period before an expected demo), and then typically interacted with on a very small scale to ensure it works as intended. This method served us relatively well when we initially started, as our beta program was intended to primarily be an avenue for members of the community to provide feedback on how the game mechanics looked and felt rather than testing every aspect of the game and how to break it. This is not strictly a bad thing - it provided valuable feedback to the development team on how a subset of the community felt about the direction we were taking the game, and gave members a platform where they could challenge our decisions and argue for alternatives.
The recent demo brought light to the fact that we have not been thoroughly testing the game, and that many points in our development lifecycle need to change. Our battle engine for example, was still being written only hours leading up to the demo and as such not only did not undergo sufficient testing, the communication layer between our game server and the battle engine was rushed, not very well thought out, and as shown by the performance of the demo, crumpled under pressure. Individually, all aspects of the backend stack worked very well. We didn't experience any crashes from any services, the battle engine operated surprisingly efficiently despite what it may have looked like to players, our game server held stable and was only interrupted for code updates, and our login server had 13 days of continuous uptime before the close of the demo. The issues that were experienced by players were far more subtle than just crashes - disagreements of network protocols between the game server and the battle engine. Issues that could've been avoided with a more well planned out development approach, and this demo experience has sparked a few conversations amongst the development team on how we can improve and do better going forward.
In the first few hours of it's launch, the game server spiked in memory usage drastically. We came dangerously close to maxing out the capabilities of the single machine we were using to host it, with the game server consuming over 20gb of memory and sitting at 80% usage of it's CPU. A 'tick' is the time measurement of the game server. Every unit of logic happens within this tick such as interacting with Pokemon, NPCs, using items, movement. Our target time is 100ms, and our spring demo was consistently ticking at 76ms which means we were doing well enough with our peak load, but if we wanted to support more players we needed to do better, especially as features in the future grow more complex and take more compute time to execute.
The Back-end Stack
Put simply, it's a mess. We as developers decided early on that to be our most productive, we'd each pick the languages we're most familiar with, and make no mistake this is absolutely a great decision for productivity and prototyping, but it presents issues when code needs to interact with other services, our when a developer needs to commit changes to a part of the game they weren't initially responsible for as it means they need to re-familiarize themselves with the language and the development environment. Fine enough when a single developer is working on the code, but not ideal when multiple developers need to collaborate. Admittedly, this was an issue brought upon by myself initially, and going forward we've decided we're going to simplify our stack and do better. Our previous tech stack was a mixture of languages:- The game server, written in Java. The language of choice of Alycia and Rebecca and served as a platform for rapid prototyping due to a vast ecosystem of knowledge and many years of combined experience by the developers. Java provides tools for our developers to be able to write code that they are confident in and familiar with, without needing to care about memory usage due to the presence of the garbage collector.
- The login server, written in Elixir. A choice made by myself, as I knew it well enough to be productive in, and it's an environment I know will perform well when under pressure due to the fault tolerant aspects of the Open Telecom Platform, with it's roots nested deeply in telecommunications stacks such as landline phone systems.
- The battle engine, written in C# almost exclusively by Benoon. When offering Benoon a project to undertake within our development team, we gave free reign and allowed them to choose the language and tools they were most familiar with in an effort to increase productivity. Admittedly, the battle engine performed remarkably well and held stack.
- The game client, with most of the code written in GDScript. Allows for rapid prototyping as the provided scripting language is very easy to pick up, and allows for us to interact with elements of the game incredibly easily.
None of these projects or languages are a bad choice. By their own merit, they all performed very well and are a fantastic choice for developing high performance applications which can process any workload given to them. The issue lies within code sharing and communication. Any time one service wanted to speak to another, we often needed to duplicate code in multiple areas, often leading to a pile of spaghetti code as we weren't able to broadly test all aspects of the game at a fixed point in time. Each project had it's own code repository and tooling, and was tested purely independently.
So what's the solution?
The solution is simple in theory - simplify the tech stack. Rather than using multiple languages, we all agree to use one, and place all of our projects in a single repository where they can share code and tests can target the entire ecosystem of our game. For our use case, we decided to go with Rust as it's a language that some of our development team is at least moderately experienced with, and compiles down to native machine code with support for targeting the C-ABI. This is to say, Rust can compile down to a format that can be natively executed by almost any other language or platform without external dependencies. This is a game changer for our development process! Godot exposes a system called "GDExtensions", which is a technology for interacting with native code from inside Godot, and there exists a Rust library for using it. This is great - it means we can efficiently share code between our servers and our game client.It started with a network serialization library - the code that converts information into small chunks of data to be transmitted across the internet between the player's client and our services. With our new development process, we were able to define code in a single place and use it across both the client and server without needing to maintain the code in multiple seperate areas. If we write automated tests that target the code, we can be sure both implementations are correct and largely free of errors or sneaky bugs. All of our developers have come together and agreed that this is the way forward. It's a major change for some of us and will require a shift in how we tackle problems, but we are confident it's for the greater success of this project.
Now this isn't just about a change in language, but a change in developer mindset. We have committed to taking our time, reviewing each others code, and writing appropriate automated tests to ensure the code does what it says it should. We are also looking to be more dilligent in including continuous integration pipelines, including scripts for automating releases. Since the start whenever we've wanted to deploy a new version of the game client or server, it has required developer intervention to manually update the service to the latest version, and in the case of our game client, manually exporting the game build and uploading it to a remote server. Our aim for snapshots and our next iteration of PokeForce is to make this process as simple as possible through the use of scripts that can be used to create reproducible builds of our game, such that when we release a new snapshot the process is seamless and stress free, and nothing gets missed due to human errors.
What's to come?
Perhaps the most interesting part about all of this is that none of it actually matters to you as a player. You'll experience almost no immediate change due to our shift in development practices, but you'll notice it in the little things. The game should feel a bit more polished and higher quality, updates to the game will be easier for developers to publish, and once we've got the majority of the groundwork laid down, we should be quite a bit more productive as a development team.Leading up to our next snapshot, we will rewrite PokeForce's backend in it's entirety from the ground up, learning from the lessons taught to us these past few years. This is similiar to how we switched to Godot, except now it's the server's time to shine and get some much needed love and care. We will be putting a bit focus on observability into this new iteration of our game - we want to be able to see as much data as possible, from exactly what Pokemon spawned at every time, how a player managed to accidentally faint a Shiny, or what movements a player took, we want to be able to record everything so we can analyse the data and use it to provide a more well rounded experience.
Our spring demo showed us that if we want to support more players, we need to look at methods of scaling horizontally. As such we will also be looking at implementing "channels", which are sharded instances of the game server that players will be able to freely switch amongst provided it is not operating at capacity. This means that we can add or remove persistent versions of the game server as player numbers vary, and we will be able to handle large influxes of players.
All of this is to say that while the demo was not perfect, it was a valuable learning experience for us as a team and we are committed now more than ever to providing the highest quality experience that we can. We appreciate you the community - you help us grow and learn from our mistakes, and we are going to come back better than ever with a fire burning in our bellies to constantly outperform ourselves and bring you a game that you can enjoy, feel safe in, and have faith in our team to listen to feedback and be vocal in our movements.
Much love,
The PokéForce Development Team