Today I learned something new. Today, I finally found enough time to figure out why my Node Exporter dashboard in Grafana couldn’t pull from my remote server.
TL;DR on the Grafana issue; the dashboard that I use has the ability to switch between data sources. Unfortunately, the dashboard hard codes the data sources into the graphs on import, which means switching between data sources will update the host but not the data source to pull the host data from in the graphs. Very frustrating. Instead of contributing to an open source dashboard and fixing the issue, I just have two dashboards, because I’m time poor.
Moving on: I noticed two things in my graphs that piqued my interest;
Firstly, we have this curiously spiking yet not spiking of the CPU cores. That’s not really all that interesting in of itself because these Docker images get hammered constantly; the interesting part is the context switching. I haven’t had to deal with context switching before because I haven’t really run a lot of Sandy Bridge gear, however I have been informed by Pikachu that Coffee Lake and earlier processors suffer performance issues with context switching following the remedies of the meltdown/spectre attacks.
Having to actually care about this is new to me because I have only ever had to care about short-lived child processes that are created/killed as needed; long running processes aren’t something that I have generally had to concern myself with. Now that the Ark population is growing and players are doing boss battles and increasing entity counts, the servers are starting to test that 1Gb connection and older generation CPU architecture (and I bought the slow one too). Because Ark processes are long-lived things, it’s going to be important that I know how to care for these beloved processes in order to reduce the ‘noise’ in my Discord whenever there’s excessive lag.
One of the major pain points is sadly Genesis and Gensis 2. If you’re an Ark player or server admin, you’ll be (totally not) surprised to know that the boss battles on these maps are extremely lag sensitive and lag inducing. Interestingly, boss battles don’t cause a lot across the map. I don’t know if this is because multithreading was released since writing this article or if this is because of something else, but it’s a little bit of a life saver.
Firstly, I need to confirm that my Docker containers are not pinning to a single CPU core. If they’re hopping cores, that alone might be enough to increase the in-game latency (theoretically speaking). You check check a tasks affinity by pulling the process ID with ps and using taskset
to output that pids affinity:
root@dodo:~# taskset -c -p 70982
pid 70982's current affinity list: 0-11
Secondly, I need to confirm that Ark processes are indeed jumping CPU and that means that I need to catch them in the act. This took me all of 1 second because it happened right before my eyes:
root@dodo:~# ps -o pid,psr,comm -p 70982
PID PSR COMMAND
70982 5 ShooterGameServ
root@dodo:~# ps -o pid,psr,comm -p 70982
PID PSR COMMAND
70982 6 ShooterGameServ
root@dodo:~# ps -o pid,psr,comm -p 70982
PID PSR COMMAND
70982 6 ShooterGameServ
So we have proven that the Arks are playing musical vCPU cores while we’re trying to enjoy a night of being brutally punshing by a game. For bonus points, you can actually use htop to watch this behavior. While writing this, the CPU changed every second. Simply press F10 and change the columns to include PROCESSOR
.
The original idea was to tie each Docker container to a single CPU and impose a memory limit on a per-container basis. I had issues getting the restrictions to stick and because it’s just personal a game server, I could quite frankly give two shits if they lag when someone loads into the Manticore. I never really bothered putting this in place, and I figured that Docker would just sort itself out. In fairness, it kind-of is.
We have two possible scenarios that could occur from CPU pinning;
I’m gunning for the former.
There are two ways to do this; the easy way, and the proper way. I am obviously going to opt for the easy way because I’m about 6 shots deep however you can google ‘cgroups’ if you want to do things properly.
The easy way is to just edit my docker-compose file:
services:
service:
cpuset: "n"
In context:
services:
service-1:
cpuset: "0"
service-2:
cpuset: "1"
service-3:
cpuset: "2"
If you’re doing something tricky with things then I cannot help you. I am specifically running 12 vCPUs on this system to run 12 long running processors. My environment is extremely controlled in this regard. It would be better to have 13 vCPU’s and have 1 for system and 1 for each container, however I dont have time to suffer a VM reboot; a Docker restart is probably the limit of my tolerance at 10pm at night.
THE LAG BECOME WORST.