04-07-2023, 04:58 PM
Over the past four days, as you may have witnessed, our server stability has been hectic. And there are many reasons why. Get comfortable because it's a long one.
Some time ago we were made aware that the BYOND developer, Lummox JR, was making good progressing on multi-threading the map processing. This was huge for us because the vast majority of our performance woes were due this; a typical night of opening the CPU profiler would reveal figures such as. As you can see, our actual code (PCPU) is in the 3% range, whereas the map (MCPU) is 100%+, scaling higher the more players there are because it involves more objects to render within the player's view. This wasn't a problem prior to Meranthe since Esshar's map was far less dynamic, using less costly 'turfs', whereas Meranthe has a ton more things going on within the map that are performance costly (due to player building, custom props, etc).
Anyway, the engine is single-core, meaning our dedicated server could only use one of our cores. Multi-threading would change this: so with 10 processing threads at 120% map CPU, that's now 12% if multi-threading is in play, for example. Pretty much eliminates the problem.
Naturally, we were very eager to do what we can to ensure the threading update is stable. Current testing was Windows only, with no Linux build available, and our server OS was the latter. Some SS13 servers were testing the Windows build and reported promising results, not just the performance gains, but stability wise. So we decided to migrate to a Windows server to try out the test build ASAP and also see if there were any bugs to report early given that the two games are very different.
This didn't go as well as we'd hope. The server would crash-- which is absolutely fine, expected even, given the nature of the testing-- within 15 minutes of going live typically. But the real problem was that the crash logs it spit out was mostly useless to the BYOND dev, so the entire night was spent trying to get something to work with, to no avail. Matters became worse when another bug was discovered that corrupted savefiles: specifically with the 515 build, not multi-threading, we were on 514 previously for a long time. Again, this issue wasn't able to be reproduced and was pretty nonsensical, so we were left with a guessing game in the debugging process.
We decided to downgrade to version 514 the day after and try to reproduce these issues on a separate test server (which is an ongoing process still). But it didn't end there, because we discovered problem #3. The Windows OS is slower at reading and writing files than Linux, and our heavy savefile system for chests caused freezes every time someone moved an item back and forth; it wasn't really noticeable in Linux but probably was there in micro stutters.
Anyway, the new dedicated server was a massive upgrade on our old one. The original server was an i7 7700k, and the new machine is a Ryzen 5600x. Ignoring the Windows specific freezes, the performance upgrade was staggering: we went from easily tipping over 120% CPU with 150 players to now hovering at the 30-40% mark. I expected maybe a 20% increase at best (or even a more neutral result) since there isn't much difference in single-core CPU speed, and what you're really paying for is the 12 threads with AMD. But it looks to be more like a 300% increase for reasons I'm still not sure on, but will sure take. Players have also reported better ping times, so the connection itself is faster.
Back to the freezes. The solution was to optimize our questionable savefile routines with chests. With how the construction system is set up, when a chest saves every single one of them does, and it didn't help that a building can have up to 10 chests & even empty slots contribute to the 'list' that Windows has to read/write. Now each chest saves individually, and a nasty issue with the chest saving whenever an item is moved was fixed. Big thanks to Nadrew throughout the entire process I've described here, as he's been extremely helpful and his efforts are the reason why this long, stressful 4-day journey was actually amazing for us.
Because the previous performance issues map performance issues aren't a problem for us with these upgrades, and despite the bumps along the way, the game is the smoothest it's ever been. That's without the multi-threading engine change which is going to be huge for us still.
Some time ago we were made aware that the BYOND developer, Lummox JR, was making good progressing on multi-threading the map processing. This was huge for us because the vast majority of our performance woes were due this; a typical night of opening the CPU profiler would reveal figures such as. As you can see, our actual code (PCPU) is in the 3% range, whereas the map (MCPU) is 100%+, scaling higher the more players there are because it involves more objects to render within the player's view. This wasn't a problem prior to Meranthe since Esshar's map was far less dynamic, using less costly 'turfs', whereas Meranthe has a ton more things going on within the map that are performance costly (due to player building, custom props, etc).
Anyway, the engine is single-core, meaning our dedicated server could only use one of our cores. Multi-threading would change this: so with 10 processing threads at 120% map CPU, that's now 12% if multi-threading is in play, for example. Pretty much eliminates the problem.
Naturally, we were very eager to do what we can to ensure the threading update is stable. Current testing was Windows only, with no Linux build available, and our server OS was the latter. Some SS13 servers were testing the Windows build and reported promising results, not just the performance gains, but stability wise. So we decided to migrate to a Windows server to try out the test build ASAP and also see if there were any bugs to report early given that the two games are very different.
This didn't go as well as we'd hope. The server would crash-- which is absolutely fine, expected even, given the nature of the testing-- within 15 minutes of going live typically. But the real problem was that the crash logs it spit out was mostly useless to the BYOND dev, so the entire night was spent trying to get something to work with, to no avail. Matters became worse when another bug was discovered that corrupted savefiles: specifically with the 515 build, not multi-threading, we were on 514 previously for a long time. Again, this issue wasn't able to be reproduced and was pretty nonsensical, so we were left with a guessing game in the debugging process.
We decided to downgrade to version 514 the day after and try to reproduce these issues on a separate test server (which is an ongoing process still). But it didn't end there, because we discovered problem #3. The Windows OS is slower at reading and writing files than Linux, and our heavy savefile system for chests caused freezes every time someone moved an item back and forth; it wasn't really noticeable in Linux but probably was there in micro stutters.
Anyway, the new dedicated server was a massive upgrade on our old one. The original server was an i7 7700k, and the new machine is a Ryzen 5600x. Ignoring the Windows specific freezes, the performance upgrade was staggering: we went from easily tipping over 120% CPU with 150 players to now hovering at the 30-40% mark. I expected maybe a 20% increase at best (or even a more neutral result) since there isn't much difference in single-core CPU speed, and what you're really paying for is the 12 threads with AMD. But it looks to be more like a 300% increase for reasons I'm still not sure on, but will sure take. Players have also reported better ping times, so the connection itself is faster.
Back to the freezes. The solution was to optimize our questionable savefile routines with chests. With how the construction system is set up, when a chest saves every single one of them does, and it didn't help that a building can have up to 10 chests & even empty slots contribute to the 'list' that Windows has to read/write. Now each chest saves individually, and a nasty issue with the chest saving whenever an item is moved was fixed. Big thanks to Nadrew throughout the entire process I've described here, as he's been extremely helpful and his efforts are the reason why this long, stressful 4-day journey was actually amazing for us.
Because the previous performance issues map performance issues aren't a problem for us with these upgrades, and despite the bumps along the way, the game is the smoothest it's ever been. That's without the multi-threading engine change which is going to be huge for us still.
It's been a good week.