This page shows you how to use the
cockroach quit command to temporarily stop a node that you plan to restart, for example, during the process of upgrading your cluster's version of CockroachDB or to perform planned maintenance (e.g., upgrading system software).
For information about permanently removing nodes to downsize a cluster or react to hardware failures, see Remove Nodes.
How it works
When you stop a node, it performs the following steps:
- Finishes in-flight requests. Note that this is a best effort that times out after the duration specified by the
- Transfers all range leases and Raft leadership to other nodes.
- Gossips its draining state to the cluster, so that other nodes do not try to distribute query planning to the draining node, and no leases are transferred to the draining node. Note that this is a best effort that times out after the duration specified by the
server.shutdown.drain_waitcluster setting, so other nodes may not receive the gossip info in time.
- No new ranges are transferred to the draining node, to avoid a possible loss of quorum after the node shuts down.
If the node then stays offline for a certain amount of time (5 minutes by default), the cluster considers the node dead and starts to transfer its range replicas to other nodes as well.
After that, if the node comes back online, its range replicas will determine whether or not they are still valid members of replica groups. If a range replica is still valid and any data in its range has changed, it will receive updates from another replica in the group. If a range replica is no longer valid, it will be removed from the node.
- Range: CockroachDB stores all user data and almost all system data in a giant sorted map of key value pairs. This keyspace is divided into "ranges", contiguous chunks of the keyspace, so that every key can always be found in a single range.
- Range Replica: CockroachDB replicates each range (3 times by default) and stores each replica on a different node.
- Range Lease: For each range, one of the replicas holds the "range lease". This replica, referred to as the "leaseholder", is the one that receives and coordinates all read and write requests for the range.
By default, if a node stays offline for more than 5 minutes, the cluster will consider it dead and will rebalance its data to other nodes. Before temporarily stopping nodes for planned maintenance (e.g., upgrading system software), if you expect any nodes to be offline for longer than 5 minutes, you can prevent the cluster from unnecessarily rebalancing data off the nodes by increasing the
server.time_until_store_dead cluster setting to match the estimated maintenance window.
For example, let's say you want to maintain a group of servers, and the nodes running on the servers may be offline for up to 15 minutes as a result. Before shutting down the nodes, you would change the
server.time_until_store_dead cluster setting as follows:
> SET CLUSTER SETTING server.time_until_store_dead = '15m0s';
After completing the maintenance work and restarting the nodes, you would then change the setting back to its default:
> SET CLUSTER SETTING server.time_until_store_dead = '5m0s';
It's also important to ensure that load balancers do not send client traffic to a node about to be shut down, even if it will only be down for a few seconds. If you find that your load balancer's health check is not always recognizing a node as unready before the node shuts down, you can increase the
server.shutdown.drain_wait setting, which tells the node to wait in an unready state for the specified duration. For example:
> SET CLUSTER SETTING server.shutdown.drain_wait = '10s';
# Temporarily stop a node: $ cockroach quit <flags> # View help: $ cockroach quit --help
quit command supports the following general-use, client connection, and logging flags.
||If specified, the node will be permanently removed instead of temporarily stopped. See Remove Nodes for more details.|
|| The server host and port number to connect to. This can be the address of any node in the cluster.
|| The server port to connect to. Note: The port number can also be specified via
|| The SQL user that will own the client session.
|| Use an insecure connection.
|| The path to the certificate directory containing the CA and client certificates and client key.
|| A connection URL to use instead of the other arguments.
Default: no URL
See Client Connection Parameters for more details.
By default, the
quit command logs errors to
If you need to troubleshoot this command's behavior, you can change its logging behavior.
Stop a node from the machine where it's running
SSH to the machine where the node is running.
If the node is running in the background and you are using a process manager for automatic restarts, use the process manager to stop the
cockroachprocess without restarting it.
If the node is running in the background and you are not using a process manager, send a kill signal to the
cockroachprocess, for example:
$ pkill cockroach
If the node is running in the foreground, press
Verify that the
cockroachprocess has stopped:
$ ps aux | grep cockroach
Alternately, you can check the node's logs for the message
server drained and shutdown completed.