Index file of Lucene in Solr source code analysis (10)

Posted by Slashscape on Sun, 26 Dec 2021 02:59:06 +0100

2021SC@SDUSC

1. .dvd and dvm file

. dvm stores the metadata of DocValue field, such as DocValue offset.

. dvd stores DocValue data.

 stay Solr4.8.0 In, dvd as well as dvm Used Lucene The encoding format is Lucene45DocValuesFormat. Similar to the previous file format, it contains Lucene45DocValuesProducer

And Lucene 45docvaluesconsumer to read and write the file.

@Override
  public DocValuesConsumer fieldsConsumer(SegmentWriteState state) throws IOException {
    return new Lucene45DocValuesConsumer(state, DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION);
  }

  @Override
  public DocValuesProducer fieldsProducer(SegmentReadState state) throws IOException {
    return new Lucene45DocValuesProducer(state, DATA_CODEC, DATA_EXTENSION, META_CODEC, META_EXTENSION);
  }

Lucene 4.5 DocValues format encodes four types through the following strategies:

NUMERIC
- Delta compressed: an integer representing the document value is written to a 16k block. In each block, the minimum value is encoded, and each entry is an increment of the minimum value. All these increments use a bit compression method. See BlockPackedWriter below for more information.
- Table compressed: when the number of different values is very small (< 256), or if gap values are used in the document value sequence, solr will use a look-up table instead. The entry of each document value is replaced by the sequence number on the table. These sequence numbers are also compressed into PackedInts format by bit compression.
- GCD compressed: when all numbers share the same divisor, the maximum common denominator (GCD) will be calculated and stored using the delta compressed strategy.
BINARY
- Fixed width binary: use a fixed length, large spliced digit group. Each document value can be obtained directly with docID * length.
- Variable width binary: it is also a large concatenated digit group, but the end address of each document is added. These addresses are written from the start of a 16k block, and each entry has an increment of the average length. For each document, the deviation from the increment (actual average) is recorded.
- Prefix compressed binary: the value will be written into a chunk of 16 (byte) size. The first value is completely recorded, while other values share the prefix. The address of the chunk is written into a block of 16k size. Start writing from the starting position of the block, and use the average value as the increment for each entry. For each document, the deviation from the increment (actual average) will be recorded.
Sorted:
- The prefix compressed binary compression method is used to realize a mapping from serial number to repeated term, and the serial numbers of all documents use the numerical compression strategy shown above
SortedSet:
- The prefix compressed binary compression method is used to realize a mapping from sequence number to repeated term. At the same time, the list of sequence numbers and the indexes of all documents on this list use the numerical compression strategy shown above.

1.1 .dvm and dvd file format

First, let's introduce it File format of dvm:

. The file structure of dvm is divided into several layers:

First floor: dvm is provided by header, numfields and footer
- Header and Footer are the same as before
- NumFields contains entries. Entry is the entry.
Level 2: Entry has four types: NumericEntry | BinaryEntry | SortedEntry | SortedSetEntry
Third floor:
- NumericEntry: there are three types: GCDNumericEntry | TableNumericEntry | DeltaNumericEntry
  - GCDNumericEntry: contains numericheader, minValue and GCD
  - TableNumericEntry: contains numericheader, tablesize and int64tablesize
  - DeltaNumericEntry: contains a NumericHeader
- BinaryEntry: there are three types: FixedBinaryEntry | VariableBinaryEntry | PrefixBinaryEntry
  - FixedBinaryEntry: contains BinaryHeader
  - VariableBinaryEntry: contains binaryheader, addressoffset, packedversion and blocksize
  - PrefixBinaryEntry: contains binaryheader, addressinterval, addressoffset, packedversion and blocksize
SortedEntry: contains fieldnumber, entrytype, binaryentry and numericentry
SortedSetEntry: contains entrytype, binaryentry, numericentry, and numericentry

Same dvd files have several layers of structure:

Layer 1: header, < NumericData | BinaryData | sorteddata > NumFields. Footer is similar to dvm. NumFields contains one Data(SortedData, BinaryData, NumericData)
The second floor:
- Numeric data: DeltaCompressedNumerics | TableCompressedNumerics | GCDCompressedNumerics corresponds to the compression method of numerics mentioned above
- BinaryData: ByteDataLength,Addresses
- SortedData: FST
Third floor:
- DeltaCompressedNumerics: BlockPackedInts(blockSize=16k)
- TableCompressedNumerics: PackedInts
- GCDCompressedNumerics: BlockPackedInts(blockSize=16k)
- Addresses: MonotonicBlockPackedInts(blockSize=16k)
The SortedSet entry stores a list of serial numbers in BinaryData, uses a growing vLong type sequence, and encodes it with a difference.

1.2 .dvm and dvd code implementation

As mentioned earlier, Lucene45DocValuesFormat includes Lucene45DocValuesProducer and Lucene45DocValuesConsumer to read and write the file, so this section mainly takes Lucene45DocValuesProducer as an example to learn about dvm and dvd.

First, learn the initialization of Lucene 45 doc values producer: the main function is to read dvm files and dvd stream. Where is being read During the dvm file, Lucene45DocValuesProducer calls readFields(in, state.fieldInfos) to obtain entry information.

protected Lucene45DocValuesProducer(SegmentReadState state, String dataCodec, String dataExtension, String metaCodec, String metaExtension) throws IOException {
    //. dvm file name
    String metaName = IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, metaExtension);
    // read in the entries from the metadata file.
    //Open dvm and get the inspection and, get the file stream,
    ChecksumIndexInput in = state.directory.openChecksumInput(metaName, state.context);
    //Get the number of document s of segment
    this.maxDoc = state.segmentInfo.getDocCount();
    boolean success = false;
    try {
      //obtain. dvm header
      version = CodecUtil.checkHeader(in, metaCodec,
                                      Lucene45DocValuesFormat.VERSION_START,
                                      Lucene45DocValuesFormat.VERSION_CURRENT);
      numerics = new HashMap<>();
      ords = new HashMap<>();
      ordIndexes = new HashMap<>();
      binaries = new HashMap<>();
      sortedSets = new HashMap<>();

      //Read NumFields < entry >
      readFields(in, state.fieldInfos);

      //Join Footer
      if (version >= Lucene45DocValuesFormat.VERSION_CHECKSUM) {
        CodecUtil.checkFooter(in);
      } else {
        CodecUtil.checkEOF(in);
      }

      success = true;
    } finally {
      if (success) {
        IOUtils.close(in);
      } else {
        IOUtils.closeWhileHandlingException(in);
      }
    }

    success = false;
    try {
      //. dvd file name
      String dataName = IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, dataExtension);
      //Open dvd file
      data = state.directory.openInput(dataName, state.context);
      //obtain. dvd header
      final int version2 = CodecUtil.checkHeader(data, dataCodec,
                                                 Lucene45DocValuesFormat.VERSION_START,
                                                 Lucene45DocValuesFormat.VERSION_CURRENT);
      if (version != version2) {
        throw new CorruptIndexException("Format versions mismatch");
      }

      success = true;
    } finally {
      if (!success) {
        IOUtils.closeWhileHandlingException(this.data);
      }
    }
    //The size of the estimation class, that is, estimation dvd stream size
    ramBytesUsed = new AtomicLong(RamUsageEstimator.shallowSizeOfInstance(getClass()));
  }

Readfields (in, state. Fieldinfo) is mainly used to read the EntryType and select the method to read the subsequent Entry information according to its value,

The following methods are involved in the function:

Numeric type readNumericEntry()

2.BinaryEntry type (readbinaryentry)

3.SortedSetEntry type readSortedField()

4.SortedSetEntry type: readSortedSetEntry(). Meanwhile, under this type, readFields also calls readSortedSetFieldWithAddresses and readSortedField respectively

private void readFields(IndexInput meta, FieldInfos infos) throws IOException {
    //Read the number of the Entry. If the number is - 1, it means that this is the last Entry.
    int fieldNumber = meta.readVInt();
    while (fieldNumber != -1) {
      // check should be: infos.fieldInfo(fieldNumber) != null, which incorporates negative check
      // but docvalues updates are currently buggy here (loading extra stuff, etc): LUCENE-5616
      if (fieldNumber < 0) {
        // trickier to validate more: because we re-use for norms, because we use multiple entries
        // for "composite" types like sortedset, etc.
        throw new CorruptIndexException("Invalid field number: " + fieldNumber + " (resource=" + meta + ")");
      }
      //Read the EntryType to distinguish the type of Entry. 0 means numeric, 1 means binary, 2 means SORTEDENTRY, and 3 means SORTED_SETENTRY
      byte type = meta.readByte();
      if (type == Lucene45DocValuesFormat.NUMERIC) {
        //Get the specific NumericEntry content and put it into the map with number as the key and NumericEntry as the value
        numerics.put(fieldNumber, readNumericEntry(meta));
      } else if (type == Lucene45DocValuesFormat.BINARY) {
        //Get the specific BinaryEntry content and put it into the map with number as the key and BinaryEntry as the value
        BinaryEntry b = readBinaryEntry(meta);
        binaries.put(fieldNumber, b);
      } else if (type == Lucene45DocValuesFormat.SORTED) {
        //Read SortedEntry
        readSortedField(fieldNumber, meta, infos);
      } else if (type == Lucene45DocValuesFormat.SORTED_SET) {
        //Read SortedSetEntry and put it into the map with number as the key and SortedSetEntry as the value
        SortedSetEntry ss = readSortedSetEntry(meta);
        sortedSets.put(fieldNumber, ss);
        //Whether the standard storage ordered collection is through the indirect conversion of address, SORTED_SET_WITH_ADDRESSES is a docid - > address > ord mapping
        if (ss.format == SORTED_SET_WITH_ADDRESSES) {
          readSortedSetFieldWithAddresses(fieldNumber, meta, infos);
          //SORTED_SET_SINGLE_VALUED_SORTED stores only the value of docid - > ord
        } else if (ss.format == SORTED_SET_SINGLE_VALUED_SORTED) {
          if (meta.readVInt() != fieldNumber) {
            throw new CorruptIndexException("sortedset entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
          }
          if (meta.readByte() != Lucene45DocValuesFormat.SORTED) {
            throw new CorruptIndexException("sortedset entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
          }
          readSortedField(fieldNumber, meta, infos);
        } else {
          throw new AssertionError();
        }
      } else {
        throw new CorruptIndexException("invalid type: " + type + ", resource=" + meta);
      }
      //Read next Entry
      fieldNumber = meta.readVInt();
    }
  }

readNumericEntry()

static NumericEntry readNumericEntry(IndexInput meta) throws IOException {
    NumericEntry entry = new NumericEntry();
    entry.format = meta.readVInt();                //NumericType, the encoding method of Numeric
    entry.missingOffset = meta.readLong();         //MissingOffset indicates the document in which the field is missing. If - 1, there is no missing document field
    entry.packedIntsVersion = meta.readVInt();     //PackedVersion the version of the packed integer
    entry.offset = meta.readLong();                //DataOffset points to Pointer to the starting position of data in dvd file
    entry.count = meta.readVLong();                //Count the number of written values
    entry.blockSize = meta.readVInt();             //BlockSize the size of the packed integer
    switch(entry.format) {
      case GCD_COMPRESSED:                         //GCD compressed (maximum common divisor compression)
        entry.minValue = meta.readLong();          //MinValue
        entry.gcd = meta.readLong();               //GCD
        break;
      case TABLE_COMPRESSED:                       //Table compressed
        if (entry.count > Integer.MAX_VALUE) {
          throw new CorruptIndexException("Cannot use TABLE_COMPRESSED with more than MAX_VALUE values, input=" + meta);
        }
        final int uniqueValues = meta.readVInt();  //TableSize
        if (uniqueValues > 256) {                  //TableSize must be less than 256
          throw new CorruptIndexException("TABLE_COMPRESSED cannot have more than 256 distinct values, input=" + meta);
        }
        entry.table = new long[uniqueValues];      //TableSize Long
        for (int i = 0; i < uniqueValues; ++i) {
          entry.table[i] = meta.readLong();
        }
        break;
      case DELTA_COMPRESSED:                      //Delta compressed
        break;
      default:
        throw new CorruptIndexException("Unknown format: " + entry.format + ", input=" + meta);
    }
    return entry;
  }

BinaryEntry()

static BinaryEntry readBinaryEntry(IndexInput meta) throws IOException {
    BinaryEntry entry = new BinaryEntry();
    entry.format = meta.readVInt();                 //BinaryType type
    entry.missingOffset = meta.readLong();          //Missing representation, same as NuericEntry
    entry.minLength = meta.readVInt();              //The minimum and maximum values of the length of the bit group that stores values of type Binary.
                                                    //If the two values are equal, then all values are of fixed size,
                                                    //And can be calculated through DataOffset + (docID * length).
                                                    //Otherwise, the value of Binary is indefinite
    entry.maxLength = meta.readVInt();
    entry.count = meta.readVLong();
    entry.offset = meta.readLong();                 //Offset of actual binary number
    switch(entry.format) {
      case BINARY_FIXED_UNCOMPRESSED:               //Fixed-width Binary
        break;
      case BINARY_PREFIX_COMPRESSED:                //Variable-width Binary
        entry.addressInterval = meta.readVInt();
        entry.addressesOffset = meta.readLong();
        entry.packedIntsVersion = meta.readVInt();
        entry.blockSize = meta.readVInt();
        break;
      case BINARY_VARIABLE_UNCOMPRESSED:            //Prefix-compressed Binary
        entry.addressesOffset = meta.readLong();
        entry.packedIntsVersion = meta.readVInt();
        entry.blockSize = meta.readVInt();
        break;
      default:
        throw new CorruptIndexException("Unknown format: " + entry.format + ", input=" + meta);
    }
    return entry;
  }

readSortedSetFieldWithAddresses()

private void readSortedSetFieldWithAddresses(int fieldNumber, IndexInput meta, FieldInfos infos) throws IOException {
    // sortedset = binary + numeric (addresses) + ordIndex
    if (meta.readVInt() != fieldNumber) {
      throw new CorruptIndexException("sortedset entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
    }
    if (meta.readByte() != Lucene45DocValuesFormat.BINARY) {
      throw new CorruptIndexException("sortedset entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
    }
    BinaryEntry b = readBinaryEntry(meta);
    binaries.put(fieldNumber, b);

    if (meta.readVInt() != fieldNumber) {
      throw new CorruptIndexException("sortedset entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
    }
    if (meta.readByte() != Lucene45DocValuesFormat.NUMERIC) {
      throw new CorruptIndexException("sortedset entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
    }
    NumericEntry n1 = readNumericEntry(meta);
    ords.put(fieldNumber, n1);

    if (meta.readVInt() != fieldNumber) {
      throw new CorruptIndexException("sortedset entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
    }
    if (meta.readByte() != Lucene45DocValuesFormat.NUMERIC) {
      throw new CorruptIndexException("sortedset entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
    }
    NumericEntry n2 = readNumericEntry(meta);
    ordIndexes.put(fieldNumber, n2);
  }

readSortedField

private void readSortedField(int fieldNumber, IndexInput meta, FieldInfos infos) throws IOException {
    // sorted = binary + numeric
    if (meta.readVInt() != fieldNumber) {
      throw new CorruptIndexException("sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
    }
    if (meta.readByte() != Lucene45DocValuesFormat.BINARY) {
      throw new CorruptIndexException("sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
    }
    BinaryEntry b = readBinaryEntry(meta);
    binaries.put(fieldNumber, b);

    if (meta.readVInt() != fieldNumber) {
      throw new CorruptIndexException("sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
    }
    if (meta.readByte() != Lucene45DocValuesFormat.NUMERIC) {
      throw new CorruptIndexException("sorted entry for field: " + fieldNumber + " is corrupt (resource=" + meta + ")");
    }
    NumericEntry n = readNumericEntry(meta);
    ords.put(fieldNumber, n);
  }

As mentioned above dvm file reading, then next learn how to do it dvd file reading.

Topics: Apache solr lucene

Programmer Think

Index file of Lucene in Solr source code analysis (10)

1. .dvd and dvm file

1.1 .dvm and dvd file format

1.2 .dvm and dvd code implementation

Hot Topics