Troubleshooting
Common failure modes and recovery procedures for the Virtufin WebSocketManager.
See also: Development guide → Troubleshooting for service-startup, port-conflict, Dapr sidecar, and protoc-generation issues. Deployment guide → Troubleshooting for Kubernetes pod, image-pull, and readiness/liveness probe issues.
Operation-level failures
"Connection not found" when sending a message
Symptom: Send, SendRaw, StartPublish, or StopPublish returns
success=false, error="Connection not found".
Likely causes:
- The connection ID is wrong (typo, stale state from a previous instance)
- The connection was reclaimed by the ConnectionReclaimerHostedService
because the owning instance died
- The connection was explicitly disconnected and removed from the state store
Diagnose:
# List all active connections
curl http://localhost:5001/connections
# Check the reclaim sweeper logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager | grep -i "reclaim"
# Check if the owning instance is still running
kubectl get pods -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager
"Connection owned by different instance" error
Symptom: Operations return error="Connection owned by different instance: <instance-id>".
Likely cause: The current pod's HOSTNAME does not match the connection's
InstanceId. This happens after a rolling update where the new pod receives a
new HOSTNAME before the old pod's connections have been reclaimed.
Fix: Wait for the ConnectionReclaimerHostedService (30-second interval) to
reclaim the connection. The new pod can then issue a Connect to recreate the
connection under the new InstanceId.
"WebSocket is not connected" error
Symptom: Send or SendRaw returns error="WebSocket is not connected".
Likely cause: The underlying ClientWebSocket has been closed or the
remote endpoint has dropped the connection.
Fix:
1. If autoReconnect=true, the wrapper will attempt reconnection with
exponential backoff (up to ReconnectMaxAttempts retries)
2. If autoReconnect=false, the connection must be re-Connect-ed manually
"Request timeout" from SendAsync
Symptom: SendAsync returns error="Request timeout" after the configured
timeout.
Likely cause: The remote WebSocket server did not send a response with a matching correlation UUID (16-byte prefix) within the timeout window.
Diagnose:
# Check the receive-loop logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager | grep -i "correlation"
# Increase the timeout in the client
await client.send(connection_id, message, timeoutMs=60000) # 60s instead of default
Connection lifecycle failures
Pod restarts and orphaned connections
Symptom: After a pod restart, the new pod reports connections that don't
match its InstanceId.
Expected behavior: The ConnectionReclaimerHostedService (30-second
interval) detects dead instances and reassigns their connections. The new pod
then sees the reclaimed connections with its own InstanceId.
Force a clean slate:
# Set CLEAR_ALL_ON_START=TRUE in the deployment env
# This clears ALL connections from the state store on startup
# instead of reclaiming reconnected state
kubectl set env deployment/virtufin-websocketmanager CLEAR_ALL_ON_START=TRUE
kubectl rollout restart deployment/virtufin-websocketmanager
Warning:
CLEAR_ALL_ON_START=TRUEis destructive — it disconnects all active WebSocket clients. Use only during planned maintenance or recovery from a known-bad state.
State store is out of sync
Symptom: The local cache shows connections that don't appear in the distributed state store, or vice versa.
Diagnose:
# Inspect the state store directly
docker exec -it dapr_redis_1 redis-cli KEYS 'websocket-*'
docker exec -it dapr_redis_1 redis-cli SMEMBERS websocket-index
Fix: Restart all WebSocketManager replicas to force a fresh
IWebSocketConnectionStore.GetAllConnectionsAsync call and rebuild the local
cache.
Pub/Sub failures
Published messages not reaching subscribers
Symptom: StartPublish returns success but downstream subscribers never
receive the messages.
Likely cause: The Dapr pub/sub component (Redis/Valkey pubsub) is not
configured, or the topic name in StartPublish doesn't match the subscriber's
expected topic.
Diagnose:
# Check the publisher logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-websocketmanager | grep -i "publish\|topic"
# Check the Dapr pub/sub component
kubectl get components -n virtufin pubsub -o yaml
# Verify the topic exists in Redis
docker exec -it dapr_redis_1 redis-cli PUBSUB CHANNELS '*'
Recovery procedures
Restart the WebSocketManager
kubectl rollout restart deployment/virtufin-websocketmanager -n virtufin
kubectl rollout status deployment/virtufin-websocketmanager -n virtufin
A rolling restart is non-disruptive — the ConnectionReclaimerHostedService
will move connections to surviving replicas.
Wipe all connections
# Manual: clear the state store directly
docker exec -it dapr_redis_1 redis-cli DEL websocket-index
docker exec -it dapr_redis_1 redis-cli --scan --pattern 'websocket-*' | xargs -L 100 docker exec -i dapr_redis_1 redis-cli DEL
Note: This is destructive and will disconnect all active WebSocket clients. Use only for recovery from a corrupted state.
Reporting issues
If none of the above resolves your issue, gather the following and file a Gitea issue:
- WebSocketManager version (
LIBRARY_VERSIONfrom the build info endpoint) - Connection ID(s) affected
- Owning instance ID (from the connection's
InstanceIdfield) - Full error message and gRPC status code
- Relevant log lines from the WebSocketManager and any affected backend