Skip to content

Troubleshooting

Common failure modes and recovery procedures for the Virtufin WebSocketManager.

See also: Development guide → Troubleshooting for service-startup, port-conflict, Dapr sidecar, and protoc-generation issues. Deployment guide → Troubleshooting for Kubernetes pod, image-pull, and readiness/liveness probe issues.

Operation-level failures

"Connection not found" when sending a message

Symptom: Send, SendRaw, StartPublish, or StopPublish returns success=false, error="Connection not found".

Likely causes: - The connection ID is wrong (typo, stale state from a previous instance) - The connection was reclaimed by the ConnectionReclaimerHostedService because the owning instance died - The connection was explicitly disconnected and removed from the state store

Diagnose:

# List all active connections
curl http://localhost:5001/connections

# Check the reclaim sweeper logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager | grep -i "reclaim"

# Check if the owning instance is still running
kubectl get pods -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager

"Connection owned by different instance" error

Symptom: Operations return error="Connection owned by different instance: <instance-id>".

Likely cause: The current pod's HOSTNAME does not match the connection's InstanceId. This happens after a rolling update where the new pod receives a new HOSTNAME before the old pod's connections have been reclaimed.

Fix: Wait for the ConnectionReclaimerHostedService (30-second interval) to reclaim the connection. The new pod can then issue a Connect to recreate the connection under the new InstanceId.

"WebSocket is not connected" error

Symptom: Send or SendRaw returns error="WebSocket is not connected".

Likely cause: The underlying ClientWebSocket has been closed or the remote endpoint has dropped the connection.

Fix: 1. If autoReconnect=true, the wrapper will attempt reconnection with exponential backoff (up to ReconnectMaxAttempts retries) 2. If autoReconnect=false, the connection must be re-Connect-ed manually

"Request timeout" from SendAsync

Symptom: SendAsync returns error="Request timeout" after the configured timeout.

Likely cause: The remote WebSocket server did not send a response with a matching correlation UUID (16-byte prefix) within the timeout window.

Diagnose:

# Check the receive-loop logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager | grep -i "correlation"

# Increase the timeout in the client
await client.send(connection_id, message, timeoutMs=60000)  # 60s instead of default

Connection lifecycle failures

Pod restarts and orphaned connections

Symptom: After a pod restart, the new pod reports connections that don't match its InstanceId.

Expected behavior: The ConnectionReclaimerHostedService (30-second interval) detects dead instances and reassigns their connections. The new pod then sees the reclaimed connections with its own InstanceId.

Force a clean slate:

# Set CLEAR_ALL_ON_START=TRUE in the deployment env
# This clears ALL connections from the state store on startup
# instead of reclaiming reconnected state

kubectl set env deployment/virtufin-websocketmanager CLEAR_ALL_ON_START=TRUE
kubectl rollout restart deployment/virtufin-websocketmanager

Warning: CLEAR_ALL_ON_START=TRUE is destructive — it disconnects all active WebSocket clients. Use only during planned maintenance or recovery from a known-bad state.

State store is out of sync

Symptom: The local cache shows connections that don't appear in the distributed state store, or vice versa.

Diagnose:

# Inspect the state store directly
docker exec -it dapr_redis_1 redis-cli KEYS 'websocket-*'
docker exec -it dapr_redis_1 redis-cli SMEMBERS websocket-index

Fix: Restart all WebSocketManager replicas to force a fresh IWebSocketConnectionStore.GetAllConnectionsAsync call and rebuild the local cache.

Pub/Sub failures

Published messages not reaching subscribers

Symptom: StartPublish returns success but downstream subscribers never receive the messages.

Likely cause: The Dapr pub/sub component (Redis/Valkey pubsub) is not configured, or the topic name in StartPublish doesn't match the subscriber's expected topic.

Diagnose:

# Check the publisher logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager | grep -i "publish\|topic"

# Check the Dapr pub/sub component
kubectl get components -n virtufin pubsub -o yaml

# Verify the topic exists in Redis
docker exec -it dapr_redis_1 redis-cli PUBSUB CHANNELS '*'

Recovery procedures

Restart the WebSocketManager

kubectl rollout restart deployment/virtufin-websocketmanager -n virtufin
kubectl rollout status deployment/virtufin-websocketmanager -n virtufin

A rolling restart is non-disruptive — the ConnectionReclaimerHostedService will move connections to surviving replicas.

Wipe all connections

# Manual: clear the state store directly
docker exec -it dapr_redis_1 redis-cli DEL websocket-index
docker exec -it dapr_redis_1 redis-cli --scan --pattern 'websocket-*' | xargs -L 100 docker exec -i dapr_redis_1 redis-cli DEL

Note: This is destructive and will disconnect all active WebSocket clients. Use only for recovery from a corrupted state.

Reporting issues

If none of the above resolves your issue, gather the following and file a Gitea issue:

  • WebSocketManager version (LIBRARY_VERSION from the build info endpoint)
  • Connection ID(s) affected
  • Owning instance ID (from the connection's InstanceId field)
  • Full error message and gRPC status code
  • Relevant log lines from the WebSocketManager and any affected backend