Before you begin
In this lab, you'll start with a fresh cluster, so make sure you've stopped and cleaned up the cluster from the previous labs.
Step 1. Start a 3-node cluster
-
In a new terminal, start node 1:
$ ./cockroach start \ --insecure \ --store=node1 \ --listen-addr=localhost:26257 \ --http-addr=localhost:8080 \ --join=localhost:26257,localhost:26258,localhost:26259 \ --logtostderr=WARNING
-
In a new terminal, start node 2:
$ ./cockroach start \ --insecure \ --store=node2 \ --listen-addr=localhost:26258 \ --http-addr=localhost:8081 \ --join=localhost:26257,localhost:26258,localhost:26259 \ --logtostderr=WARNING
-
In a new terminal, start node 3:
$ ./cockroach start \ --insecure \ --store=node3 \ --listen-addr=localhost:26259 \ --http-addr=localhost:8082 \ --join=localhost:26257,localhost:26258,localhost:26259 \ --logtostderr=WARNING
-
In a new terminal, perform a one-time initialization of the cluster:
$ ./cockroach init --insecure --host=localhost:26257
Step 2. Prepare to simulate the problem
Before you can manually corrupt data, you need to import enough data so that the cluster creates persistent .sst
files.
-
Create a database into which you'll import a new table:
$ ./cockroach sql \ --insecure \ --host=localhost:26257 \ --execute="CREATE DATABASE import_test;"
-
Run the
IMPORT
command, using schema and data files we've made publicly available on Google Cloud Storage:$ ./cockroach sql \ --insecure \ --host=localhost:26257 \ --database="import_test" \ --execute="IMPORT TABLE orders CREATE USING 'https://storage.googleapis.com/cockroach-fixtures/tpch-csv/schema/orders.sql' CSV DATA ('https://storage.googleapis.com/cockroach-fixtures/tpch-csv/sf-1/orders.tbl.1') WITH delimiter = '|';"
The import will take a minute or two. Once it completes, you'll see a confirmation with details:
job_id | status | fraction_completed | rows | index_entries | system_records | bytes +--------------------+-----------+--------------------+--------+---------------+----------------+----------+ 378521252933861377 | succeeded | 1 | 187500 | 375000 | 0 | 26346739 (1 row)
Step 2. Simulate the problem
-
In the same terminal, look in the data directory of
node3
:$ ls node3
000003.log IDENTITY OPTIONS-000005 cockroach.http-addr 000006.sst LOCK auxiliary cockroach.listen-addr COCKROACHDB_VERSION MANIFEST-000001 cockroach-temp478417278 logs CURRENT MANIFEST-000007 cockroach.advertise-addr temp-dirs-record.txt
-
Delete one of the
.sst
files. -
In the terminal where node 3 is running, press CTRL-C to stop it.
-
Try to restart node 3:
$ ./cockroach start \ --insecure \ --store=node3 \ --listen-addr=localhost:26259 \ --http-addr=localhost:8082 \ --join=localhost:26257,localhost:26258,localhost:26259 \ --logtostderr=WARNING
The startup process will fail, and you'll see the following printed to
stderr
:W180209 10:45:03.684512 1 cli/start.go:697 Using the default setting for --cache (128 MiB). A significantly larger value is usually needed for good performance. If you have a dedicated server a reasonable setting is --cache=25% (2.0 GiB). W180209 10:45:03.805541 37 gossip/gossip.go:1241 [n?] no incoming or outgoing connections E180209 10:45:03.808537 1 cli/error.go:68 cockroach server exited with error: failed to create engines: could not open rocksdb instance: Corruption: Sst file size mismatch: /Users/jesseseldess/cockroachdb-training/cockroach-{{page.release_info.version}}.darwin-10.9-amd64/node3/000006.sst. Size recorded in manifest 2626945, actual size 2626210 * * ERROR: cockroach server exited with error: failed to create engines: could not open rocksdb instance: Corruption: Sst file size mismatch: /Users/jesseseldess/cockroachdb-training/cockroach-{{page.release_info.version}}.darwin-10.9-amd64/node3/000006.sst. Size recorded in manifest 2626945, actual size 2626210 * * Failed running "start"
The error tells you that the failure has to do with RocksDB-level (i.e., storage-level) corruption. Because the node's data is corrupt, the node will not restart.
Step 3. Resolve the problem
Because only 1 node's data is corrupt, the solution is to completely remove the node's data directory and restart the node.
-
Remove the
node3
data directory:$ rm -rf node3
-
In the terminal where node 3 was running, restart the node:
$ ./cockroach start \ --insecure \ --store=node3 \ --listen-addr=localhost:26259 \ --http-addr=localhost:8082 \ --join=localhost:26257,localhost:26258,localhost:26259 \ --logtostderr=WARNING
In this case, the cluster repairs the node using data from the other nodes. In more severe emergencies where multiple disks are corrupted, there are tools like cockroach debug rocksdb
to let you inspect the files in more detail and try to repair them. If enough nodes/files are corrupted, restoring to a enterprise backup is best.
{{site.data.alerts.callout_danger}} In all cases of data corruption, you should get support from Cockroach Labs. {{site.data.alerts.end}}