Index file of Lucene for Solr source code analysis

Posted by DaiWelsh on Thu, 09 Dec 2021 18:49:16 +0100

2021SC@SDUSC
I Segments_N file
An index corresponds to a directory, and the index files are stored in the directory. The index file of Solr is stored in the core/data/index directory under Solr/Home, and one core corresponds to one index.

  Segments_N All valid indexes are enumerated segments Information and the specific information to be deleted. An index can have multiple Segments_N，But what works is always N The biggest one, why are there multiple segments_N，Mainly because they cannot be deleted temporarily or there are indexwriter In progress commit Operation, or IndexDeletionPolicy In progress. Segments_N The code is mainly in Segmentsinfos.java Inside.

1.1 Segments_ Selection of n
How to select Segments_N file to read:

Traverse the index directory, starting with Segments but not Segments_gen file, take the largest N as genA.

String[] files = null;
long genA = -1;
files = directory.listAll();
if (files != null) {
      genA = getLastCommitGeneration(files);
}

...

public static long getLastCommitGeneration(String[] files) {
    if (files == null) {
      return -1;
    }
    long max = -1;
    for (String file : files) {
      if (file.startsWith(IndexFileNames.SEGMENTS) && !file.equals(IndexFileNames.SEGMENTS_GEN)) {
        long gen = generationFromSegmentsFileName(file);
        if (gen > max) {
          max = gen;
        }
      }
    }
    return max;
  }

Open segments Gen, where the current N value is saved. The format is as follows. Read out the version number, and then read out two N. if they are equal, they will be regarded as genB.

long genB = -1;
ChecksumIndexInput genInput = null;
try {
       genInput = directory.openChecksumInput(IndexFileNames.SEGMENTS_GEN, IOContext.READONCE);
} catch (IOException e) {
...
int version = genInput.readInt();
long gen0 = genInput.readLong();
long gen1 = genInput.readLong();10if (gen0 == gen1) {          genB = gen0;   }

Select the largest one among the genA and genB obtained above as the current n, and then open segments_N file

 1 gen = Math.max(genA, genB);

1.2 Segments_ Structure of n
Segment structure:

Header, Version, NameCounter, SegCount, <SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>SegCount, CommitUserData, Footer

Where < segname, segcodec, delgen, deletioncount, fieldinfosgen, updatesfiles > represents the information of a segment, and SegCount represents the number of segments, so

< segname, segcodec, delgen, deletioncount, fieldinfosgen, updatesfiles > SegCount indicates that such SegCount segments are connected together.

Head: head is a codeheader, including magic, codecname and version.
Magic is a start indicator, usually 1071082519

CodecName is the identifier of the file

              Version Index file version information when using a version number IndexReader When reading the index generated by another version number, an error will be reported because the value is different.

public static int checkHeader(DataInput in, String codec, int minVersion, int maxVersion)
    throws IOException {

    // Safety to guard against reading a bogus string:
    final int actualHeader = in.readInt();   //Read Magic
    if (actualHeader != CODEC_MAGIC) {
      throw new CorruptIndexException("codec header mismatch: actual header=" + actualHeader + " vs expected header=" + CODEC_MAGIC + " (resource: " + in + ")");
    }
    return checkHeaderNoMagic(in, codec, minVersion, maxVersion); //Read CodecName and Version and judge
  }

Version:
The version number of the index, which records the number of times that the IndexWriter submits changes to the index file
In most cases, the initial value is read from the index file.
We don't care about the specific number of times IndexWriter submits changes to the index, but more about which is the latest. IndexReader often compares its version with the version in the index file to determine whether it has been updated by IndexWriter after the IndexReader is opened.

public boolean isCurrent() throws IOException {
    ensureOpen();
    if (writer == null || writer.isClosed()) {
      // Fully read the segments file: this ensures that it's
      // completely written so that if
      // IndexWriter.prepareCommit has been called (but not
      // yet commit), then the reader will still see itself as
      // current:
      SegmentInfos sis = new SegmentInfos();
      sis.read(directory);

      // we loaded SegmentInfos from the directory
      return sis.getVersion() == segmentInfos.getVersion();
    } else {
      return writer.nrtIsCurrent(segmentInfos);
    }

NameCount
Is the segment name of the next new segment.
All index files belonging to the same segment take the segment name as the file name, generally_ 0.xxx, _0.yyy, _1.xxx, _1.yyy ……
The segment name of the newly generated segment is generally the original maximum segment name plus one.
SegCount
Number of segments.
Metadata information of SegCount segments:
SegName: segment name. All files belonging to the same segment have the segment name as the file name.
SegCodec: the codec name of the coded segment
del file version number
In Lucene, deleted documents are saved in before optimizing del file.
DelGen adds 1 whenever IndexWriter submits a delete operation to the index file and generates a new one del file
If the value is set to - 1, the document is not deleted
DeletionCount: the number of documents deleted in this segment
FieldInfosGen: the version information of the domain file in segment. If the value is - 1, the domain file has not been updated. If it is greater than 0, the domain file has been updated
UpdatesFiles: a list of files that store updates to this segment
CommitUserData:
Footer: the end of the codec code, including the verification and the verification algorithm ID
1.3 read()
You can compare segments by looking at the read() function_ Format of n

public final void read(Directory directory, String segmentFileName) throws IOException {
    boolean success = false;

    // Clear any previous segments:
    this.clear();
    //Get the current segment code, that is, Segment_N value of n
    generation = generationFromSegmentsFileName(segmentFileName);

    lastGeneration = generation;
    //Obtain inspection and
    ChecksumIndexInput input = directory.openChecksumInput(segmentFileName, IOContext.READ);
    try {
      //Generally, the Magic of the Header is 1071082519. The Header consists of Magic, Codecname and Version
      final int format = input.readInt();
      final int actualFormat;
      if (format == CodecUtil.CODEC_MAGIC) {
        // 4.0 + get Codecname information of Header
        actualFormat = CodecUtil.checkHeaderNoMagic(input, "segments", VERSION_40, VERSION_48);
        version = input.readLong();         //Get the version information of the Header
        counter = input.readInt();          //Get NameCount, that is, the name of the next new segment
        int numSegments = input.readInt();  //Get the number of segment s
        if (numSegments < 0) {
          throw new CorruptIndexException("invalid segment count: " + numSegments + " (resource: " + input + ")");
        }
        //Traverse SegCount segment data
        for(int seg=0;seg<numSegments;seg++) {
          String segName = input.readString();                  //SegName
          Codec codec = Codec.forName(input.readString());      //SegCodec
          //System.out.println("SIS.read seg=" + seg + " codec=" + codec);
          SegmentInfo info = codec.segmentInfoFormat().getSegmentInfoReader().read(directory, segName, IOContext.READ);
          info.setCodec(codec);
          long delGen = input.readLong();                       //DelGen
          int delCount = input.readInt();                       //DeletionCount
          if (delCount < 0 || delCount > info.getDocCount()) {
            throw new CorruptIndexException("invalid deletion count: " + delCount + " vs docCount=" + info.getDocCount() + " (resource: " + input + ")");
          }
          long fieldInfosGen = -1;
          if (actualFormat >= VERSION_46) {
            fieldInfosGen = input.readLong();                   //FieldInfosGen
          }
          SegmentCommitInfo siPerCommit = new SegmentCommitInfo(info, delCount, delGen, fieldInfosGen);
          if (actualFormat >= VERSION_46) {
            //UpdatesFiles first reads the number of UpdatesFiles. If it is equal to 0, then there are no updated files,
            //Otherwise, get all numGensUpdatesFiles and write them to SegmentCommitInfo.
            int numGensUpdatesFiles = input.readInt();
            final Map<Long,Set<String>> genUpdatesFiles;
            if (numGensUpdatesFiles == 0) {
              genUpdatesFiles = Collections.emptyMap();
            } else {
              genUpdatesFiles = new HashMap<>(numGensUpdatesFiles);
              for (int i = 0; i < numGensUpdatesFiles; i++) {
                genUpdatesFiles.put(input.readLong(), input.readStringSet());
              }
            }
            siPerCommit.setGenUpdatesFiles(genUpdatesFiles);
          }
          add(siPerCommit);
        }
        userData = input.readStringStringMap();                  //CommitUserData
      } else {
        actualFormat = -1;
        Lucene3xSegmentInfoReader.readLegacyInfos(this, directory, input, format);
        Codec codec = Codec.forName("Lucene3x");
        for (SegmentCommitInfo info : this) {
          info.info.setCodec(codec);
        }
      }
      //Footer
      if (actualFormat >= VERSION_48) {
        CodecUtil.checkFooter(input);
      } else {
        final long checksumNow = input.getChecksum();
        final long checksumThen = input.readLong();
        if (checksumNow != checksumThen) {
          throw new CorruptIndexException("checksum mismatch in segments file (resource: " + input + ")");
        }
        CodecUtil.checkEOF(input);
      }

      success = true;
    } finally {
      if (!success) {
        // Clear any segment infos we had loaded so we
        // have a clean slate on retry:
        this.clear();
        IOUtils.closeWhileHandlingException(input);
      } else {
        input.close();
      }
    }
  }

SegmentInfo info = codec.segmentInfoFormat().getSegmentInfoReader().read(directory, segName, IOContext.READ);

Comparing read() and write(), we can basically see that write is the reverse process of read.

The primary guarantee of the read process is that the segments we read are up-to-date. read() is a process of repeatedly trying to read the latest segmentinfo. If an IOException occurs, it indicates that a commit operation is in progress, and the segment information obtained at this time is not the latest. Lucene provides three methods to try to obtain the latest segment information:

1. The first is to obtain the maximum gen(generation) mentioned above. After two attempts, if the maximum gen is greater than lastgen, it indicates that the segment information has been updated, otherwise it indicates that it has not been updated or this method is not applicable, so switch to the second method.

If the first method fails, you can directly use gen + +, that is, you can directly parse the segment of the next gen_ N file.
If the parsing fails, go back to gen, Gen –, and try to parse the segment of the Gen_ N file, i.e. segment information, has not been updated

1.4 write()
The process of write is opposite to that of read(). Here we mainly want to know about SegmentCommitInfo and genUpdatesFiles.

First, let's look at the call relationship of write: the commit operation is divided into two parts: prepareCommit and finishcommt. prepareCommit calls write to the new Segment_N. After that, the real commit operation is performed in finishcommit. If the operation fails, the regression is performed. After the commit is successful, write the gen information to segment gen.

final void prepareCommit(Directory dir) throws IOException {
    if (pendingSegnOutput != null) {
      throw new IllegalStateException("prepareCommit was already called");
    }
    write(dir);
  }

for (SegmentCommitInfo siPerCommit : this) {
        SegmentInfo si = siPerCommit.info;
        segnOutput.writeString(si.name);
        segnOutput.writeString(si.getCodec().getName());
        segnOutput.writeLong(siPerCommit.getDelGen());
        int delCount = siPerCommit.getDelCount();
        if (delCount < 0 || delCount > si.getDocCount()) {
          throw new IllegalStateException("cannot write segment: invalid docCount segment=" + si.name + " docCount=" + si.getDocCount() + " delCount=" + delCount);
        }
        segnOutput.writeInt(delCount);
        segnOutput.writeLong(siPerCommit.getFieldInfosGen());
        final Map<Long,Set<String>> genUpdatesFiles = siPerCommit.getUpdatesFiles();
        segnOutput.writeInt(genUpdatesFiles.size());
        for (Entry<Long,Set<String>> e : genUpdatesFiles.entrySet()) {
          segnOutput.writeLong(e.getKey());
          segnOutput.writeStringSet(e.getValue());
        }
...
}

In finishcommit, three operations are mainly completed,

1. In segment_ Add footer to N. if the adding fails, roll back.

2. Call directory SYC () writes all writer s to the disk. If the write fails, it rolls back

3. Generate segment_gen.

An IndexOutput stream pendingSegnOutput will appear between the prepareCommit and finishcommt processes, which is responsible for writing the status of Sgementinfos during the current commit to the segment_ In N, it only exists between prepareCommit and finishcommt, and is null at other times.

So far, basically for segmentinfo Java this class has a preliminary understanding, as well as to Segment_N and segment_ Understanding the structure of Gen. But here I still have a question: segment_ There is an updatefiles in each segment of the n structure. What's the use of it? Keep it for later learning.

Topics: solr search engine lucene

Programmer Think

Index file of Lucene for Solr source code analysis

Hot Topics