Troubleshooting

Common failure modes and recovery procedures for the Virtufin WebSocketManager.

See also: Development guide → Troubleshooting for service-startup, port-conflict, Dapr sidecar, and protoc-generation issues. Deployment guide → Troubleshooting for Kubernetes pod, image-pull, and readiness/liveness probe issues.

Operation-level failures

"Connection not found" when sending a message

Symptom: Send, SendRaw, StartPublish, or StopPublish returns success=false, error="Connection not found".

Likely causes: - The connection ID is wrong (typo, stale state from a previous instance) - The connection was reclaimed by the ConnectionReclaimerHostedService because the owning instance died - The connection was explicitly disconnected and removed from the state store

Diagnose:

# List all active connections
curl http://localhost:5001/connections

# Check the reclaim sweeper logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager | grep -i "reclaim"

# Check if the owning instance is still running
kubectl get pods -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager

"Connection owned by different instance" error

Symptom: Operations return error="Connection owned by different instance: <instance-id>".

Likely cause: The current pod's HOSTNAME does not match the connection's InstanceId. This happens after a rolling update where the new pod receives a new HOSTNAME before the old pod's connections have been reclaimed.

Fix: Wait for the ConnectionReclaimerHostedService (30-second interval) to reclaim the connection. The new pod can then issue a Connect to recreate the connection under the new InstanceId.

"WebSocket is not connected" error

Symptom: Send or SendRaw returns error="WebSocket is not connected".

Likely cause: The underlying ClientWebSocket has been closed or the remote endpoint has dropped the connection.

Fix: 1. If autoReconnect=true, the wrapper will attempt reconnection with exponential backoff (up to ReconnectMaxAttempts retries) 2. If autoReconnect=false, the connection must be re-Connect-ed manually

"Request timeout" from SendAsync

Symptom: SendAsync returns error="Request timeout" after the configured timeout.

Likely cause: The remote WebSocket server did not send a response with a matching correlation UUID (16-byte prefix) within the timeout window.

Diagnose:

# Check the receive-loop logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager | grep -i "correlation"

# Increase the timeout in the client
await client.send(connection_id, message, timeoutMs=60000)  # 60s instead of default

Connection lifecycle failures

Pod restarts and orphaned connections

Symptom: After a pod restart, the new pod reports connections that don't match its InstanceId.

Expected behavior: The ConnectionReclaimerHostedService (30-second interval) detects dead instances and reassigns their connections. The new pod then sees the reclaimed connections with its own InstanceId.

Force a clean slate:

# Set CLEAR_ALL_ON_START=TRUE in the deployment env
# This clears ALL connections from the state store on startup
# instead of reclaiming reconnected state

kubectl set env deployment/virtufin-websocketmanager CLEAR_ALL_ON_START=TRUE
kubectl rollout restart deployment/virtufin-websocketmanager

Warning: CLEAR_ALL_ON_START=TRUE is destructive — it disconnects all active WebSocket clients. Use only during planned maintenance or recovery from a known-bad state.

State store is out of sync

Symptom: The local cache shows connections that don't appear in the distributed state store, or vice versa.

Diagnose:

# Inspect the state store directly
docker exec -it dapr_redis_1 redis-cli KEYS 'websocket-*'
docker exec -it dapr_redis_1 redis-cli SMEMBERS websocket-index

Fix: Restart all WebSocketManager replicas to force a fresh IWebSocketConnectionStore.GetAllConnectionsAsync call and rebuild the local cache.

Pub/Sub failures

Published messages not reaching subscribers

Symptom: StartPublish returns success but downstream subscribers never receive the messages.

Likely cause: The Dapr pub/sub component (Redis/Valkey pubsub) is not configured, or the topic name in StartPublish doesn't match the subscriber's expected topic.

Diagnose:

# Check the publisher logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager | grep -i "publish\|topic"

# Check the Dapr pub/sub component
kubectl get components -n virtufin pubsub -o yaml

# Verify the topic exists in Redis
docker exec -it dapr_redis_1 redis-cli PUBSUB CHANNELS '*'

Recovery procedures

Restart the WebSocketManager

kubectl rollout restart deployment/virtufin-websocketmanager -n virtufin
kubectl rollout status deployment/virtufin-websocketmanager -n virtufin

A rolling restart is non-disruptive — the ConnectionReclaimerHostedService will move connections to surviving replicas.

Wipe all connections

# Manual: clear the state store directly
docker exec -it dapr_redis_1 redis-cli DEL websocket-index
docker exec -it dapr_redis_1 redis-cli --scan --pattern 'websocket-*' | xargs -L 100 docker exec -i dapr_redis_1 redis-cli DEL

Note: This is destructive and will disconnect all active WebSocket clients. Use only for recovery from a corrupted state.

Reporting issues

If none of the above resolves your issue, gather the following and file a Gitea issue:

WebSocketManager version (LIBRARY_VERSION from the build info endpoint)
Connection ID(s) affected
Owning instance ID (from the connection's InstanceId field)
Full error message and gRPC status code
Relevant log lines from the WebSocketManager and any affected backend