C + + Project: boost website site search

Posted by lajollabob on Sat, 26 Feb 2022 01:29:25 +0100

I Project requirements

There is no search and navigation function for the boost website, which provides search function for the document search of the boost website

II Forward index and inverted index

Forward index is similar to the directory of books. We can find the corresponding content according to page number.
Inverted index and forward index are opposite concepts. We can query the documents according to the contents of the documents and find the corresponding files.

III Project module

1. Pretreatment module

Organize all offline html documents into one line text file
Specific process:
Use the file system provided by boost to list the paths of all html documents to facilitate subsequent opening;
Read the content of each document, parse the title, url and body, organize the title, url and body of each document into one line of data, and write the data obtained after parsing all documents into the file for subsequent processing.

2. Index module

Parsing the lines file obtained by the preprocessing module, constructing the forward index and inverted index, and providing the check row and the inverted interface for external use.
Specific process:
1. design the data structure corresponding to forward index and inverted index.
The forward index is stored in an array, and the subscript corresponds to the document id. The array stores the DocInfo of the document information structure, which contains the contents of the document id, title, online url, and content.
The inverted index is stored through the hash table. key is the word in the document, and value is all the document sets in which the word appears. This document set is called the inverted zipper. The inverted zipper is represented by an array. This array stores the document information structure weight corresponding to the word in the document. The weight structure contains the document id, the weight corresponding to the word, and the content of the word

	//document information
	struct DocInfo{
		int64_t docId;//Document id
		string title;//title
		string url;//url
		string content;//content
	};

	//Document information where words appear
	struct Weight{
		int64_t docId;//Document id
		int weight;//weight
		string word;//backups	
	};

	//Inverted zipper
	typedef vector<Weight> InvertedList;

	vector<DocInfo> forwardIndex;//Forward index
	unordered_map<string, InvertedList> invertedIndex;//Inverted index: content + set of documents appearing

2. read the preprocessed row text file by row, parse each row data and build forward index index inverted index.

  • Building forward index
    The data is divided into the invisible characters' \3 'by each line. The title, url and text of the document are obtained. A DocInfo object is organized. The DocInfo object is inserted into the forward index array, and a mapping relationship between the array subscript and DocInfo information is constructed.
//Create forward index
DocInfo* Index::buildForward(const string& line){
	//1. Cut according to \ 3
	vector<string> tokens;
	common::Util::split(line,"\3",&tokens);
	if(tokens.size() != 3){
		return nullptr;
	}
	
	//Create the split object and fill in the doc
	DocInfo docInfo;
	docInfo.docId=forwardIndex.size();
	docInfo.title=tokens[0];
	docInfo.url=tokens[1];
	docInfo.content=tokens[2];
	//forwardIndex.push_back(docInfo);
	//Direct handling of docInfo - C++11
	forwardIndex.push_back(std::move(docInfo));
	//3. Return the pointer of the obtained DocInfo object for the construction of inverted index
	//& DocInfo cannot be returned, because DocInfo will be destroyed after exiting the scope, and then dereference will use wild pointers
	return &forwardIndex.back();
}

  • Build inverted index
    Segment the title in the document information, and use the hash table to count the number of times the result of word segmentation appears in the title;
    Segment the text in the document information, and use the hash table to count the number of times the segmentation results appear in the text;
    The frequency information of the word after word segmentation in the hash table is represented by a structure, which has two members: the number of occurrences in the title and the number of occurrences in the text.
    Traverse the hash table storing the frequency information, get the frequency information corresponding to each word, calculate the corresponding Weight, the Weight is equal to the number of times the word appears in the title * 10 + the number of times it appears in the text, and organize a Weight structure; Get the reference of the value corresponding to the word in the hash table of the inverted index, that is, the inverted zipper, and add the Weight structure to the document information set (inverted zipper) where the word appears; After the hash table is traversed, the inverted index construction is completed.
//Build inverted index
void Index::buildInverted(const DocInfo& docInfo){
	//Create a structure for counting word frequency
	struct WordCnt{
		int titleCnt;  //Number of occurrences in the title
		int contentCnt;//Number of occurrences in text
		WordCnt() : titleCnt(0), contentCnt(0){}
	};
	//Use hash table for word frequency statistics
	unordered_map<string,WordCnt> wordCntMap;

	//1. Word segmentation for document title
	vector<string> titleTokens;
	cutWord(docInfo.title,&titleTokens);
	//2. Count the number of times each word appears in the title according to the word segmentation results
	for(string word:titleTokens){
		//Case insensitive, all converted to lowercase
		boost::to_lower(word);
		++wordCntMap[word].titleCnt;
	}
	//3. Word segmentation for document body
	vector<string> contentTokens;
	cutWord(docInfo.content,&contentTokens);
	//4. Count the number of occurrences of each word in the text according to the word segmentation results
	for(string word:contentTokens){
		//Case insensitive, all converted to lowercase
		boost::to_lower(word);
		++wordCntMap[word].contentCnt;
	}
	//5. Traverse the statistical results and build the inverted index
	//(key is the word and value is the weight)
	//The type obtained by auto is a pair
	for(auto wordPair : wordCntMap){
		Weight weight;
		weight.docId=docInfo.docId;
		//Weight algorithm: the number of occurrences in the title * 10 + the number of occurrences in the text
		weight.weight=wordPair.second.titleCnt * 10 + wordPair.second.contentCnt;
		//Store a copy of this word in the weight object for later use
		weight.word=wordPair.first;
		//Update inverted index
		//Find the corresponding inverted zipper in the inverted index according to the current word
		//Returns the reference of the corresponding inverted zipper if it exists
		//There is no reference to create an element and return the key as the mapping value of the current word
		//Insert the weight object into the back of the inverted zipper
		InvertedList& invertedList = invertedIndex[wordPair.first];
		invertedList.push_back(weight);
	}
}

3. Realize the positive check and reverse check interfaces
The internal logic of checking the alignment is to access the vector container according to the subscript, and the document id is the subscript of the vector;
The internal logic of inverted query is to obtain value through the key of hash table and obtain the corresponding inverted zipper according to the word

3. Search module

Query the inverted index and forward index according to the query words, organize a query result.
The specific process is:
Segmenting query words;
Invert the word segmentation results, and merge all the inverted zippers into a large array, which stores all the document sets in which the query words appear;
Sort all elements in the document set in descending order according to their weights;
Check and arrange the sorted set, and organize the query results into a JSON object, which has three members: title, url and summary. Organize the results of all elements in the set into a JSON format;

	//Process search
	bool Searcher::search(const std::string& query, std::string* output){
		//1. Word segmentation: segment query words
		vector<string> tokens;
		index->cutWord(query,&tokens);
		//2. Trigger: according to the word segmentation result, check the inverted row and find the relevant document id
		vector<Weight> allTokens;//Store all the information queried
		for(string word : tokens){
			//Ignore case before inverting
			boost::to_lower(word);
			const auto* invertedList = index->getInverted(word);
			if(invertedList == nullptr){
				//The word was not found
				continue;
			}
			//Find the query results and merge the query results into a large array
			//The result of word segmentation may be multiple, and the inverted zipper of each word is combined into an array
			//Then sort the array
			allTokens.insert(allTokens.end(),invertedList->begin(),invertedList->end());
		}	
		//3. Sort: sort the results according to the number of times the word is in the document
		//In descending order of weight
		std::sort(allTokens.begin(),allTokens.end(),
			[](const Weight& w1, const Weight& w2){
				return w1.weight > w2.weight;
			});
		//4. Construction result: check and arrange according to the final result, and construct the data in json format
		//The class Json::Value can be used as a vector or a map
		Json::Value results;
		for(const auto& weight : allTokens){
			//Check the alignment according to the docId in weight
			//Construct the relevant content of the query result into a string in json format
			const auto* docInfo = index->getDocInfo(weight.docId);
			Json::Value result;
			result["title"]=docInfo->title;
			result["url"]=docInfo->url;
			result["desc"]=generateDesc(docInfo->content,weight.word);//Get body summary
			results.append(result);
		}

		//Serialize the Json::Value object into a string and write it into the string output
		Json::FastWriter writer;
		*output=writer.write(results);
		return true;
	}

4. Server module

Build an http server to process the query request from the browser, call the code of the search module to get the query result, and organize the query result into a static web page. This web page allows users to click to jump to the relevant web page, and display the title of the web page and the relevant summary of the website content.

IV Technical point

Forward index and inverted index;
C++11 mobile semantics;
IO multiplexing;
Thread pool;

V Project source code

https://gitee.com/xigongxiaoche/project/tree/master/boostSearch

Topics: C++ server