Online games no longer working since yesterday

Yesterday we noticed that the online games in our game stopped working, we can’t get into the games as the connection is closed before that. We thought it was due to a problem in our end but then we tried with older builds that were working fine and they stopped working as well.
Is there something in the Starter environment that would cause matchmaking behavior changes? Or maybe recent changes in AccelByte’s end?

Please let me know if you need more info.

Thanks!

Hi @xalbertus1,

Sorry for the inconvenience you’re experiencing. Could you provide more details about what’s happening? We’ve tested our Byte Wars sample in the starter environment and were able to successfully perform matchmaking and join the DS without issues.

It seems from a quick look at your namespace that you were able to perform matchmaking successfully. For instance, session ID 44d896ca859545dc9ea29d6e9d530abc was able to claim a dedicated server (DS), as shown in the logs:

Additionally, the user was able to join the session:

You should be able to access this information by going to the Session & Party History
image

Hey @Damar, the issue is that the client attempted to connect to the dedicated server but the dedicated server had yet to receive the serverClaimed message. Did something change that would cause either a) the server to not receive that message or b) the server to receive that message later than it normally would? Is there a race condition here in that the clients can be told about the game session before the server knows it is claimed? We haven’t run into this issue before, it just started happening in the past 48 hours and we’ve been running on AB starter for at least a month.

Looks like the hub stopped responding to pings after ~17 minutes:

[2024-09-10T06:31:52.840Z][WARN] LogHubConnection: [1] ping pong timeout
[2024-09-10T06:31:52.840Z][WARN] LogHubConnection: [1] closing with error code 3000
[2024-09-10T06:32:22.842Z][INFO] LogHubConnection: [1] closed with code 1006

…but our dedicated server doesn’t handle that well and doesn’t attempt to reconnect. I’ll make that change, but it’d also be nice to know why the hub is timing out all of a sudden.

This is strange, how come our server was still able to be claimed even after our connection to the hub was closed 13 hours prior?

// Hub connection open
[2024-09-10T04:16:18.883Z][INFO] LogHubConnection: [1] open
[2024-09-10T04:16:18.885Z][INFO] LogHub: received topic: DSHUB_CONNECTED

...

// Hub connection closed due to timeout (~2 hours later)
[2024-09-10T06:31:52.840Z][WARN] LogHubConnection: [1] ping pong timeout
[2024-09-10T06:31:52.840Z][WARN] LogHubConnection: [1] closing with error code 3000
[2024-09-10T06:32:22.842Z][INFO] LogHubConnection: [1] closed with code 1006

...

// Hub connection claimed (~13 hours later)
[2024-09-10T19:37:31.649Z][WARN] LogListenServer: connection established prior to being claimed

Hi @njupshot,

Thank you for the information. Disconnection from DS Hub can occur for various reasons, such as:

  1. Poor network connection
  2. Service pod relocation to another node

Both Unreal and Unity SDKs already have mechanisms to handle this issue. To provide further details on how the Unreal SDK handle the websocket disconnection event, there is an automatic reconnection in the Websocket class, and the current behavior is as follow

  1. OnConnectionClosed triggered due to disconnection, the SDK will read the Close code and there are three different categories here
  • Normal closure (1000): Websocket connection is closed due to request from the client/server. The SDK will not trigger any reconnection
  • Abnormal closure (1001 - 1015): Websocket connection is closed due to unexpected reasons and it will follow the mechanism defined RFC (link). The SDK will trigger auto reconnection
  • Backend-defined closure (3000 - 4099): Websocket connection is closed from the Backend due to certain reasons. The SDK will trigger auto reconnection when receiving close code 3000 to 4000, so 4001 and above will not trigger any reconnection.
  1. Auto reconnection will use backoff retry, so when it failed to reconnect then the SDK will apply delay with exponential duration from the previous delay to try another reconnect.
  2. Auto reconnection has a timeout duration and by default is 60 seconds, so when the SDK already retry to reconnect in this 60 seconds then it will give up and trigger OnDisconnected delegate.

There maybe cases when the DS Hub unable to reconnect, DS should shutdown immediately to avoid further issue.

1 Like

This is great info, thanks @Damar

@Damar I still wanted to get clarification on something, my understanding is that the dedis connection to the hub is what allows AB to know that it’s available to host games. If that connection is closed, does that not inform AB that the server is no longer available for hosting? In other words, why in this case did AB think the server could host the game if the realtime connection it had to the hub was closed?

@njupshot the communication between DS and DS Hub is primarily for our service to send notifications to the DS. However, the primary source of truth regarding the DS’s health comes from AMS. As long as AMS considers the DS active, it will continue to be regarded as operational.

If the DS doesn’t want to be assigned or can’t recover from a persistent failure (such as failing to reconnect after retries), it should unregister itself from AMS. This can be achieved by simply shutting down the DS, which will ensure that AMS no longer treats it as active and will stop assigning new sessions to it.

Thanks again @damar, that all makes a lot of sense. I assume then that AMS knows about the health of the DS via the watchdog connection, this is all great to know and I’ll update our implementation appropriately. Thanks!