Cluster Unavailability Troubleshooting

2020-02-16

Before you begin

Make sure you have already completed Under-Replication Troubleshooting and have a cluster of 3 nodes running.

Step 1. Simulate the problem

  1. In the terminal where node 2 is running, press CTRL-C.

  2. In the terminal where node 3 is running, press CTRL-C. You may need to press CRTL + C a second time to force this node to stop.

Step 2. Troubleshoot the problem

  1. Go back to the Admin UI:

    CockroachDB Admin UI

    You'll notice that an error is shown and timeseries metrics are no longer being reported.

  2. In a new terminal, try to query the one node that was not stopped:

    $ ./cockroach sql \
    --insecure \
    --host=localhost:26257 \
    --execute="SHOW DATABASES;" \
    --logtostderr=WARNING
    

    Because all ranges in the cluster, specifically the system ranges, no longer have a majority of their replicas, the cluster as a whole cannot make progress, and so the query will hang indefinitely.

Step 3. Resolve the problem

  1. In the terminal where node 2 was running, restart the node:

    $ ./cockroach start \
    --insecure \
    --store=node2 \
    --listen-addr=localhost:26258 \
    --http-addr=localhost:8081 \
    --join=localhost:26257,localhost:26258,localhost:26259
    
  2. In the terminal where node 3 was running, restart the node:

    $ ./cockroach start \
    --insecure \
    --store=node3 \
    --listen-addr=localhost:26259 \
    --http-addr=localhost:8082 \
    --join=localhost:26257,localhost:26258,localhost:26259
    
  3. Go back to the terminal where you issued the query.

    All ranges have a majority of their replicas again, and so the query executes and succeeds:

      database_name
    +---------------+
      defaultdb
      postgres
      system
    (3 rows)
    

Clean up

In the next module, you'll start a new cluster from scratch, so take a moment to clean things up.

  1. Stop all CockroachDB nodes:

    $ pkill -9 cockroach
    
  2. Remove the nodes' data directories:

    $ rm -rf node1 node2 node3
    

What's next?

Data Unavailability Troubleshooting