brief introduction
As Alibaba's important layout in APM and IOT, time series database carries Alibaba's future and platoon leader in physical network and future application monitoring market, and InfluxDB, the industry's number one time series database, has a large number of users both at home and abroad. Alibaba timely launched ALiyun InfluxDB?.
This article is limited to one of the modules of InfluxDB: snapshot, which optimizes its mechanism and memory usage.
Why snapshot
InfluxDB uses a TSM engine, which consists of several main parts: cache, wal, tsm file, compactor.
The TSM storage engine, whose core idea is similar to LSM Tree, caches the most recent data on disk and triggers snapshot after reaching a preset threshold, which is often referred to as a snapshot brush disk.
The purpose of memory is to cache and speed up queries.Snpshot mainly solves the problem of data persistence.
The working mechanism of Snapshot
Since snapshot brushes data from a cache to disk, let's first look at the internal structure of the cache.
The internal structure of the Cache
As shown in the figure above, each cache is internally divided into 16 partitions.Each partition contains one map, and all maps have a SeriesKey key and a value of entry.
// Value represents a TSM-encoded value. type Value interface { // UnixNano returns the timestamp of the value in nanoseconds since unix epoch. UnixNano() int64 // Value returns the underlying value. Value() interface{} // Size returns the number of bytes necessary to represent the value and its timestamp. Size() int // String returns the string representation of the value and its timestamp. String() string // internalOnly is unexported to ensure implementations of Value // can only originate in this package. internalOnly() } // Values represents a slice of values. type Values []Value // entry is a set of values and some metadata. type entry struct { mu sync.RWMutex values Values // All stored values. vtype byte }
entry is an array of Value types, Value itself is an interface and is divided into FloatValue, StringValue, BooleanValue, IntegerValue, FloatValue, StringValue according to the type of value.(
For example, with FloatValue, each type of Value contains an int type timestamp and a specific value.
type FloatValue struct { unixnano int64 value float64 }
Snapshot process
At the code level, the overall overview process is as follows:
Let's analyze one by one:
Entry to Snapshot
if e.enableCompactionsOnOpen { e.SetCompactionsEnabled(true) }
Mechanism of Snapshot
// compactCache continually checks if the WAL cache should be written to disk. func (e *Engine) compactCache() { t := time.NewTicker(time.Second) defer t.Stop() for { e.mu.RLock() quit := e.snapDone e.mu.RUnlock() select { case <-quit: tsdb.UpdateCacheSize(e.id, 0, e.logger) return case <-t.C: e.Cache.UpdateAge() tsdb.UpdateCacheSize(e.id, e.Cache.Size(), e.logger) if e.ShouldCompactCache(time.Now()) { start := time.Now() e.traceLogger.Info("Compacting cache", zap.String("path", e.path)) err := e.WriteSnapshot() if err != nil && err != errCompactionsDisabled { e.logger.Info("Error writing snapshot", zap.Error(err)) atomic.AddInt64(&e.stats.CacheCompactionErrors, 1) } else { atomic.AddInt64(&e.stats.CacheCompactions, 1) } atomic.AddInt64(&e.stats.CacheCompactionDuration, time.Since(start).Nanoseconds()) } } } }
Check every second for snapshot.
There are two conditions for snapshot:
Whether the configured threshold has been reached.(25M by default)
Does the interval from the last snapshot exceed: the configuration of cache-snapshot-write-cold-duration.(10 minutes by default)
Specific implementation of Snapshot
So here are two questions:
1. What is the file format of the disc?
2. How does the process of snapshot brush disc itself proceed?
Let's first look at the file format of the drop:
![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2019/png/90583/1564739463888-fba3e6ef-e794-4d9f-bd9d-359a351fe37a.png)
The TSM file consists of three parts: Series Data Section, Series Index Section, and Footer.
1. Series Data Section generation:
The Series Data Section consists of several Series Data Block s.
For Series Data Block, this is assembled in memory with the cacheKeyIterator.encode function as follows:
func (c *cacheKeyIterator) encode() { concurrency := runtime.GOMAXPROCS(0) n := len(c.ready) // Divide the keyset across each CPU chunkSize := 1 idx := uint64(0) for i := 0; i < concurrency; i++ { // Run one goroutine per CPU and encode a section of the key space concurrently go func() { tenc := getTimeEncoder(tsdb.DefaultMaxPointsPerBlock) fenc := getFloatEncoder(tsdb.DefaultMaxPointsPerBlock) benc := getBooleanEncoder(tsdb.DefaultMaxPointsPerBlock) uenc := getUnsignedEncoder(tsdb.DefaultMaxPointsPerBlock) senc := getStringEncoder(tsdb.DefaultMaxPointsPerBlock) ienc := getIntegerEncoder(tsdb.DefaultMaxPointsPerBlock) defer putTimeEncoder(tenc) defer putFloatEncoder(fenc) defer putBooleanEncoder(benc) defer putUnsignedEncoder(uenc) defer putStringEncoder(senc) defer putIntegerEncoder(ienc) for { i := int(atomic.AddUint64(&idx, uint64(chunkSize))) - chunkSize if i >= n { break } key := c.order[i] values := c.cache.values(key) for len(values) > 0 { end := len(values) if end > c.size { end = c.size } minTime, maxTime := values[0].UnixNano(), values[end-1].UnixNano() var b []byte var err error switch values[0].(type) { case FloatValue: b, err = encodeFloatBlockUsing(nil, values[:end], tenc, fenc) case IntegerValue: b, err = encodeIntegerBlockUsing(nil, values[:end], tenc, ienc) case UnsignedValue: b, err = encodeUnsignedBlockUsing(nil, values[:end], tenc, uenc) case BooleanValue: b, err = encodeBooleanBlockUsing(nil, values[:end], tenc, benc) case StringValue: b, err = encodeStringBlockUsing(nil, values[:end], tenc, senc) default: b, err = Values(values[:end]).Encode(nil) } values = values[end:] c.blocks[i] = append(c.blocks[i], cacheBlock{ k: key, minTime: minTime, maxTime: maxTime, b: b, err: err, }) if err != nil { c.err = err } } // Notify this key is fully encoded c.ready[i] <- struct{}{} } }() } }
Several different data types are assembled with different Encoder s, and finally a two-dimensional array of cacheBlock s is formed and stored in the iter.
The next question is how to brush these two-dimensional arrays.
2. Brush discs in turn
Previously, we knew that these cacheBlock s were retained in iters, and we just needed to iterate through the iterators to brush these data disks.But there is one more question, how is the Series Index Section generated?
Because the Series Index Section is ultimately composed of IndexEntry, where minTime, maxTime, and Size can be obtained from cacheBlock data, the key is Offset.
In fact, Offset's calculations move forward as iteration progresses, like filling a Buffer with a Series Data Block and updating the Offset once.
The key code is as follows:
n, err := t.w.Write(block) // During Write, t.n, offset, is updated if err != nil { return err } n += len(checksum) // Record this block in index t.index.Add(key, blockType, minTime, maxTime, t.n, uint32(n)) //t.n is offset
To summarize the above, the general process is to encode the raw Series Data Block, generate the Series Index Section during the iteration, and eventually generate the TSM file from the Series Index Section Append to the Series Data Block.
The problem is that the Series Index Section requires space to save. If the Series Index Section takes up too much memory, it may increase the risk of program OOME.
Optimize memory usage for snapshot
If you have n cache s, make snapshot at the same time.The memory used is n IndexSize.(
For example: 5dB 4 retention, IndexSize = 50m.Save: 5450 = 1G of memory usage.
Therefore, one of the optimizations we can think of is that files are used in the snapshot process to temporarily save this portion of memory for the Series Index Section.
Verification
When we use disks to buffer Index, the system generates a temporary index file during the snapshot process, as shown in the following figure.
This file is not generated when the disk is not used for the Index buffer, as shown in the following figure.
Our long-term stability tests have demonstrated that using disks as Index buffers can effectively reduce the OOME probability under high system pressure.
Commercialization
* Aliyun InfluxDB # is now officially commercialized. Welcome to the purchase page. https://common-buy.aliyun.com/?commodityCode=hitsdb_influxdb_pre#/buy) And documentation ( https://help.aliyun.com/document_detail/113093.html?spm=a2c4e.11153940.0.0.57b04a02biWzGa).