Files
usda-vision/docs/CONTAINER_CRASH_DEBUGGING.md
salirezav 933d4417a5 Update Docker configuration, enhance error handling, and improve logging
- Added health check to the camera management API service in docker-compose.yml for better container reliability.
- Updated installation scripts in Dockerfile to check for existing dependencies before installation, improving efficiency.
- Enhanced error handling in the USDAVisionSystem class to allow partial operation if some components fail to start, preventing immediate shutdown.
- Improved logging throughout the application, including more detailed error messages and critical error handling in the main loop.
- Refactored WebSocketManager and CameraMonitor classes to use debug logging for connection events, reducing log noise.
2025-12-03 17:23:31 -05:00

3.4 KiB

Container Crash Debugging Guide

Overview

This guide helps diagnose and fix crashes in the usda-vision-api container.

Quick Diagnostic

Run the diagnostic script:

./scripts/diagnose_container_crashes.sh

Common Causes of Crashes

1. MQTT Connection Failure

Symptoms: Container exits immediately after startup Check:

docker logs usda-vision-api | grep -i mqtt

Fix: Ensure MQTT broker is accessible at 192.168.1.110:1883

2. Camera SDK Initialization Failure

Symptoms: Container crashes during camera initialization Check:

docker logs usda-vision-api | grep -i "camera\|sdk"

Fix: Check camera hardware connection and SDK installation

3. Storage Path Issues

Symptoms: Container fails to start storage manager Check:

docker exec usda-vision-api ls -la /mnt/nfs_share

Fix: Ensure /mnt/nfs_share is mounted and writable

4. Out of Memory (OOM)

Symptoms: Container killed by system, exit code 137 Check:

dmesg | grep -i "killed process"
docker stats usda-vision-api

Fix: Add memory limits or increase available memory

5. Missing Configuration File

Symptoms: Container starts but exits quickly Check:

docker exec usda-vision-api cat /app/config.compose.json

Fix: Ensure config.compose.json exists in the container

6. Python Exception Not Caught

Symptoms: Container exits with Python traceback Check:

docker logs usda-vision-api --tail 100

Fix: Check logs for unhandled exceptions

Recent Improvements

Enhanced Error Handling

  • System now continues running even if some non-critical components fail
  • Added consecutive error tracking to prevent infinite crash loops
  • Better logging with full exception traces

Improved Startup Command

  • Only installs packages if missing (faster startup)
  • Better error messages during startup
  • Graceful exit with delay for log flushing

Health Checks

  • Added Docker healthcheck to monitor container health
  • Health endpoint: http://localhost:8000/health

Debugging Steps

  1. Check container status:

    docker ps -a | grep usda-vision-api
    
  2. View recent logs:

    docker logs usda-vision-api --tail 100 -f
    
  3. Check exit code:

    docker inspect usda-vision-api --format='{{.State.ExitCode}}'
    
    • 0 = Normal exit
    • 1 = Application error
    • 137 = Killed (usually OOM)
  4. Check restart count:

    docker inspect usda-vision-api --format='{{.RestartCount}}'
    
  5. Run with debug logging: Edit docker-compose.yml and change the command to:

    python main.py --config config.compose.json --debug --verbose
    
  6. Check resource usage:

    docker stats usda-vision-api
    

Manual Testing

To test the container manually:

docker exec -it usda-vision-api bash
python main.py --config config.compose.json --debug

Prevention

The container now has:

  • Automatic restart policy (restart: unless-stopped)
  • Health checks
  • Better error handling
  • Graceful shutdown on signals
  • Partial operation if some components fail

Getting Help

If crashes persist:

  1. Run the diagnostic script
  2. Collect logs: docker logs usda-vision-api > crash_logs.txt
  3. Check system resources: docker stats usda-vision-api
  4. Review recent changes to configuration or code