- We have discussed snapshots, transaction logs, and storage devices. In this section, we will discuss how to implement these functions on the file system.
- There are two types of data stores: transaction log files and snapshot files. Both types of files are saved to the local file system in the form of ordinary files. The transaction log is written during transaction processing, so we strongly recommend that it be stored on a dedicated device (because it is very important to maintain good throughput and latency). Saving the transaction log file without a dedicated storage device will not cause any correctness problems, but will affect performance. In a virtualized environment, dedicated storage may not be available. The snapshot file is not required to be stored on a dedicated storage device because it is slowly written by the background thread.
- The snapshot file will be written to the directory specified by DataDir, and the transaction log file will be written to the directory specified by DataLogDir. First, let's look at the files in the transaction log directory. If you list the contents of this directory, you will see a subdirectory named version-2. We have made a major change to the format of logs and snapshots. When we make this change, we will find it useful to separate data by file version, so that it is easier to handle data migration between versions.
1. Transaction log
- After executing "create /hh", let's look at the directory of the transaction log, where there is only one log file.
--zkCli.sh
[zk: localhost:2181(CONNECTED) 0] create /hh
Created /hh
--linux file system
]# ll /tmp/zookeeper/log/version-2/
-rw-r--r-- 1 root root 67108880 1 October 18:52 log.100000001
- We can observe these documents carefully. First, although we only created one znode, the transaction log files are very large (64MB each); Second, the transaction log file name has a large number as a suffix.
- ZooKeeper pre allocates a certain amount of disk space for transaction log files to avoid the overhead of metadata management during each write. If you dump and print these files in hexadecimal, you will see that all these files are filled with null characters (\ 0), and there are only a small number of binary data bits at the beginning. After the server runs for a period of time, the null characters are gradually replaced by log data.
- The log file contains the transaction label zxid, but in order to facilitate recovery and allow quick search, the suffix of each log file is the first zxid of the log file, which is expressed in hexadecimal. One advantage of representing zxid in hexadecimal is that it can quickly distinguish the timestamp part from the counter part in zxid, so the timestamp of the transaction log file in the previous example is 1.
- However, we also want to continue to see what is saved in the file, which is also very helpful for problem diagnosis. Sometimes, developers claim that ZooKeeper has lost some znodes. At this time, only by looking up the transaction log file can we know which znodes the client has deleted. We can view the transaction log file with the following command:
--stay/etc/profile Add the following to and source
export ZOOKEEPER_HOME=/usr/local/apache-zookeeper-3.5.9-bin/
JAVA_OPTS="$JAVA_OPTS -Djava.ext.dirs=$ZOOKEEPER_HOME:$ZOOKEEPER_HOME/lib"
--Commands for viewing transaction logs
]# java $JAVA_OPTS org.apache.zookeeper.server.LogFormatter /tmp/zookeeper/log/version-2/log.100000001
--View the output of the transaction log
ZooKeeper Transactional Log File with dbid 0 txnlog format version 2
1/18/22 10:52:12 AM CST session 0x2000019637d0000 cxid 0x0 zxid 0x100000001 createSession 30000
1/18/22 10:52:23 AM CST session 0x2000019637d0000 cxid 0x1 zxid 0x100000002 create '/hh,,v{s{31,s{'world,'anyone}}},F,1
EOF reached after 2 txns.
- Each line in the transaction log file is a transaction. Because only change operations are recorded in the transaction log, no read operations are seen in the transaction log.
2. Snapshot
- The naming rules of snapshot files are similar to those of transaction log files. The following is the snapshot list information of the server in the previous example:
]# ll /tmp/zookeeper/data/version-2/
-rw-r--r-- 1 root root 1 1 June 18-22:16 acceptedEpoch
-rw-r--r-- 1 root root 1 1 June 18-22:16 currentEpoch
-rw-r--r-- 1 root root 556 1 June 18-22:16 snapshot.0
-rw-r--r-- 1 root root 556 1 June 18-22:16 snapshot.100000000
- Snapshot files do not pre allocate disk space, so the size of snapshot files can more accurately reflect the amount of data they contain. The suffix used reflects the zxid at the beginning of the snapshot. As we discussed earlier, the snapshot file is actually a fuzzy snapshot; The transaction log itself is not a valid snapshot until it is replayed. Specifically, to restore the system, the transaction log must be replayed from zxid or earlier of the snapshot suffix.
- The snapshot file itself also stores fuzzy snapshot data in binary form. Therefore, there is another tool to view the contents of the snapshot file:
--stay/etc/profile Add the following to and source
export ZOOKEEPER_HOME=/usr/local/apache-zookeeper-3.5.9-bin/
JAVA_OPTS="$JAVA_OPTS -Djava.ext.dirs=$ZOOKEEPER_HOME:$ZOOKEEPER_HOME/lib"
--Commands for viewing snapshot contents
]# java $JAVA_OPTS org.apache.zookeeper.server.SnapshotFormatter /tmp/zookeeper/data/version-2/snapshot.100000000
--View the output of the snapshot content command
ZNode Details (count=5):
----
/
cZxid = 0x00000000000000
ctime = Thu Jan 01 08:00:00 CST 1970
mZxid = 0x00000000000000
mtime = Thu Jan 01 08:00:00 CST 1970
pZxid = 0x00000000000000
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x00000000000000
dataLength = 0
----
/zookeeper
cZxid = 0x00000000000000
ctime = Thu Jan 01 08:00:00 CST 1970
mZxid = 0x00000000000000
mtime = Thu Jan 01 08:00:00 CST 1970
pZxid = 0x00000000000000
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x00000000000000
dataLength = 0
----
/zookeeper/config
cZxid = 0x00000000000000
ctime = Thu Jan 01 08:00:00 CST 1970
mZxid = 0x00000000000000
mtime = Tue Jan 18 22:16:17 CST 2022
pZxid = 0x00000000000000
cversion = 0
dataVersion = -1
aclVersion = -1
ephemeralOwner = 0x00000000000000
dataLength = 132
----
/zookeeper/quota
cZxid = 0x00000000000000
ctime = Thu Jan 01 08:00:00 CST 1970
mZxid = 0x00000000000000
mtime = Thu Jan 01 08:00:00 CST 1970
pZxid = 0x00000000000000
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x00000000000000
dataLength = 0
----
Session Details (sid, timeout, ephemeralCount):
- The snapshot only stores metadata for each znode. In this way, the operation and maintenance personnel can know when a znode has changed and which znode node occupies a lot of memory. Unfortunately, snapshots do not store data and ACLS. Therefore, when diagnosing a problem, you must combine the snapshot information with the transaction log file information to analyze the problem.
3. Timestamp file
- The persistent state of zookeeper consists of two small files, which are two timestamp files named acceptedeoch and currenteoch. We have discussed the concept of timestamp before, and these two files reflect the information accepted and being processed by a server process. Although these two files do not contain any application data information, they are very important for data consistency, so don't forget these two files when backing up the original data files of a zookeeper server.
4. Use data stored in ZooKeeper
- ZooKeeper has an advantage in storing data: servers in both stand-alone mode and cluster mode store data in the same way. We just mentioned that in order to get an accurate data view, we need to merge transaction logs and snapshots. You can copy the transaction log files and snapshot files to a blank data directory under an independent server, and then start the service, which will truly reflect the status information on the server you copied. This technology allows you to copy the status information of the server from the production environment for later review and other purposes.
- At the same time, it also means that you can easily complete the backup of zookeeper server by simply backing up these data files. If you use this method for backup, you still need to pay attention to some problems. Firstly, zookeeper serves replication, so there is redundant information in the system. For example, for backup, only the data information of one server needs to be backed up.
- When the ZooKeeper server approves a transaction, it will promise to record the status of the transaction from then on. You must remember this, which is very important. Therefore, if you use old backup files to restore a server, it will cause the server to violate its commitment. If you have just experienced the data loss of all servers, this may not be a big problem, but if your cluster is working normally and you restore a server to the old state, your behavior may cause other servers to lose some information.
- If you want to recover all or most servers from data loss, the best way is to use the latest captured status information (backup files obtained from the latest surviving server) and copy the status information to all other servers before starting the server.
- To back up ZooKeeper data, you only need to back up the snapshot directory and transaction log directory on one server.