kkfileview previews pdf format files to realize multi keyword highlighting and positioning

Posted by minc on Sun, 16 Jan 2022 16:10:03 +0100

Recently, I have been working on the file search function (ElasticSearch for search). After searching the document at the front end, I need to preview the document in the browser and highlight and locate the search keywords. Kkfileview, an open source project selected for document preview, can preview documents well, but it does not have the function of previewing document keyword highlighting. Under the guidance of the leaders of the kkfileview technical exchange group, we know that pdfjs needs to be modified. Please see the specific steps below. [Note: only the highlighting and positioning functions of kkfileview preview in pdf format are realized here]

Key words: ElasticSearch; kkFileView; PDF JS # multi keyword highlighting # keyword positioning

Reference blog: https://blog.csdn.net/a973685825/article/details/81285819

GitHub address of kkfileview: https://github.com/kekingcn/kkFileView

pdf. GitHub address of JS: https://github.com/mozilla/pdf.js

catalogue

One is to obtain highlighted keywords

II. kkfileview pass keywords

III. PDF JS get keywords

IV. PDF JS highlight keywords

V. packaging PDF js

Operation effect

epilogue

One is to obtain highlighted keywords

Whatever else, first we need to get the highlighted keywords. For our search, I use_ analyze this API to obtain the word segmentation results of query statements as our highlighted keywords. There are a lot of information about elastic search online, so I won't repeat it here. Form the obtained keywords into a string and separate them with spaces (of course, you can also use other separators, and then pay attention to the separators when converting the string into an array). Name it keyString and pass in the keyword parameters when calling kkfileview (see its official w for how to call kkfileview).

# keyString is a string of keywords separated by spaces. Example: "knowledge map Overview"
# Compared with before, keyString is added to the last face
'http://127.0.0.1:8012/onlinePreview?url='+encodeURIComponent(Base64.encode(url)) + keyString;

II. kkfileview pass keywords

kkfileview mainly serves as a connecting link between the preceding and the following. First, get the keyString from ES, and then pass it to PDF js. Let's see how to get it. There are a lot of project files, but rest assured, we only need to modify server / SRC / main / resources / Web / PDF FTL and server / SRC / main / Java / CN / keking / Web / controller / onlinepreviewcontroller Java these two files.

kkfileview through onlinepreviewcontroller The onlinePreview function in the java file gets the parameters in the url, so we need to modify this function to get the keyString. Two modifications are required: 1. Add keyString to the function parameter; 2 add the keyword attribute to the model. For details, see the comments in the code block below.

    @RequestMapping(value = "/onlinePreview")
    // The parameter passed to this function adds a keyString
    public String onlinePreview(String url, String keyword, Model model, HttpServletRequest req) {
        String fileUrl;
        try {
            fileUrl = new String(Base64.decodeBase64(url), StandardCharsets.UTF_8);
        } catch (Exception ex) {
            String errorMsg = String.format(BASE64_DECODE_ERROR_MSG, "url");
            return otherFilePreview.notSupportedFile(model, errorMsg);
        }
        FileAttribute fileAttribute = fileHandlerService.getFileAttribute(fileUrl, req);
        model.addAttribute("file", fileAttribute);
        model.addAttribute("keyword", keyword);  // Added this line
        FilePreview filePreview = previewFactory.get(fileAttribute);
        logger.info("Preview Area  url: {},previewType: {}", fileUrl, fileAttribute.getType());
        return filePreview.filePreviewHandle(fileUrl, model, fileAttribute);
    }

After modifying the above part, kkfileview has obtained the highlighted keyword. Now you need to pass it to PDF JS, need to modify PDF The script part of the FTL file is actually to get the keywords first, and then pass them to PDF through the url js. Notice the pdfExt here. This is a new folder I created, which will be discussed in part 5.

    var url = '${finalUrl}';
    var baseUrl = '${baseUrl}'.endsWith('/') ? '${baseUrl}' : '${baseUrl}' + '/';
    if (!url.startsWith(baseUrl)) {
        url = baseUrl + 'getCorsFile?urlPath=' + encodeURIComponent(url);
    }
    // First, get the keyword from our model
    var keyword = '${keyword}';

    // Then add the keyword parameter at the end
    // In the source code of kkfileview, there is disabledownload in the parameter, which was deleted by me
    // Note the pdfExt here. This is a new folder I created, which will be mentioned later
    document.getElementsByTagName('iframe')[0].src = "${baseUrl}pdfExt/web/viewer.html?file=" + encodeURIComponent(url)+ "&keyword="+ keyword;

    document.getElementsByTagName('iframe')[0].height = document.documentElement.clientHeight - 10;

      

III. PDF JS get keywords

        pdf.js can get the file address parameters. We first find out where this part of the code is written, and then add the code to get the keyword parameters. pdf.js is in Web / viewer JS this part of the code gets the address of the file.

  (function rewriteUrlClosure() {
    // Run this code outside DOMContentLoaded to make sure that the URL
    // is rewritten as soon as possible.
    const queryString = document.location.search.slice(1);
    const m = /(^|&)file=([^&]*)/.exec(queryString);
    defaultUrl = m ? decodeURIComponent(m[2]) : "";

    // Example: chrome-extension://.../http://example.com/file.pdf
    const humanReadableUrl = "/" + defaultUrl + location.hash;
    history.replaceState(history.state, "", humanReadableUrl);
    if (top === window) {
      // eslint-disable-next-line no-undef
      chrome.runtime.sendMessage("showPageAction");
    }
  })();

Then we just need to follow this example to obtain the keyword parameter. I put the code to obtain the keyword parameter in the getViewerConfiguration function of this file.

function getViewerConfiguration() {
  
  // Added this part of the code
  const queryString = document.location.search.slice(1);
  const m = /(^|&)keyword=([^&]*)/.exec(queryString);
  const keyword = m ? decodeURIComponent(m[2]) : "";
  console.log("keyword", keyword);

  let errorWrapper = null;
  if (typeof PDFJSDev === "undefined" || !PDFJSDev.test("MOZCENTRAL")){
    errorWrapper = {
      container: document.getElementById("errorWrapper"),
      errorMessage: document.getElementById("errorMessage"),
      closeButton: document.getElementById("errorClose"),
      errorMoreInfo: document.getElementById("errorMoreInfo"),
      moreInfoButton: document.getElementById("errorShowMore"),
      lessInfoButton: document.getElementById("errorShowLess"),
    };
  }

  // There are too many contents, which are omitted here
}

Well, now our PDF JS finally got the keyword. Let's talk about how to highlight the keyword

IV. PDF JS highlight keywords

Now let's look at PDF JS file. Before reading this part, it is strongly recommended to read the content of the article and the first reference blog. As mentioned in the reference blog, our way to highlight keywords is to call PDF JS has its own search function to pass in keywords. Let's see how to do this part.

(1) we need to import keywords into Web / viewer In the input box of id="findInput" in the HTML file, that is, here below.

          <div id="findbarInputContainer">
            <input id="findInput" class="toolbarField" title="Find" placeholder="Find in document..." tabindex="91" data-l10n-id="find_input">
            <div class="splitToolbarButton">
              <button id="findPrevious" class="toolbarButton findPrevious" title="Find the previous occurrence of the phrase" tabindex="92" data-l10n-id="find_previous">
                <span data-l10n-id="find_previous_label">Previous</span>
              </button>
              <div class="splitToolbarButtonSeparator"></div>
              <button id="findNext" class="toolbarButton findNext" title="Find the next occurrence of the phrase" tabindex="93" data-l10n-id="find_next">
                <span data-l10n-id="find_next_label">Next</span>
              </button>
            </div>
          </div>

Once the id is found, it is easy to modify the value. I choose to modify it in the getViewerConfiguration function, as shown below

function getViewerConfiguration() {

  //Here is the code just added
  const queryString = document.location.search.slice(1);
  const m = /(^|&)keyword=([^&]*)/.exec(queryString);
  const keyword = m ? decodeURIComponent(m[2]) : "";
  console.log("keyword", keyword);

  let errorWrapper = null;
  if (typeof PDFJSDev === "undefined" || !PDFJSDev.test("MOZCENTRAL")) {
    errorWrapper = {
      container: document.getElementById("errorWrapper"),
      errorMessage: document.getElementById("errorMessage"),
      closeButton: document.getElementById("errorClose"),
      errorMoreInfo: document.getElementById("errorMoreInfo"),
      moreInfoButton: document.getElementById("errorShowMore"),
      lessInfoButton: document.getElementById("errorShowLess"),
    };
  }

  // This part of code is added to pass the value of keyword to input
  document.getElementById("findInput").value = keyword;
  // To disable findbar, see the reference blog for specific reasons
  document.getElementById("findbar").style.display = "none";

  return {
      //There are too many contents, which are omitted here
  }
}

(2) then we need to handle the value passed to findinput and modify the web / APP JS

  async _initializeViewerComponents() {
    // ellipsis
    
    //Modify the content of this judgment
    if (!this.supportsIntegratedFind) {
      this.findBar = new PDFFindBar(appConfig.findBar, eventBus, this.l10n); // Instantiate PDFFindBar

      // Get value
      const highLightStr = appConfig.findBar.findField.value;
      // Convert the string into an array (if the keystring separator passed in when you call kkfileview is not a space, remember to modify it here
      const highLightWords = highLightStr.split(" ");
      // Pass the array to our newly added function (we'll talk about it in the next step)
      wordHighLight(highLightWords);
    }

    // ellipsis
  }

(3) then we go to Web / APP JS file, add the function wordHighLight mentioned just now, where the search execution function is called

function wordHighLight(hightLightWords) {
  // Copied from reference blog

  const evt = {
    // Source: pdffindbar, / / the instance of pdffindbar. I'm not sure what it is for?
    type: "", // It should be empty by default
    // You can jump to the query location by default, which just meets the requirements
    query: hightLightWords, // Highlighted keywords
    phraseSearch: false, // The whole text matching is supported. If multiple words are matched, the matching can only be false
    caseSensitive: false, // The default is false. Case is ignored during search
    highlightAll: true, // Set to true to highlight all keywords
    // findPrevious: true,
  };
  PDFViewerApplication.findController.executeCommand("find" + evt.type, {
    // Search execution function
    query: evt.query,
    phraseSearch: evt.phraseSearch,
    caseSensitive: evt.caseSensitive,
    highlightAll: evt.highlightAll,
    findPrevious: evt.findPrevious,
  });
}

(4) now we need to modify the code to execute the search. We need to highlight multiple keywords (after all, there are often more than one keyword in the search statement), but PDF The search function of JS can only highlight a single keyword, so we need to modify Web / PDF a little_ find_ controller. JS file.

        wordHighLight ==> executeCommand ==>_ nextMatch ==> _ calculateMatch  ==> _ calculateWordMatch (= = > means calling), let's modify it first_ Calculatematch. See the following code and comments for details.

_calculateMatch(pageIndex) {
    let pageContent = this._pageContents[pageIndex];
    const pageDiffs = this._pageDiffs[pageIndex];
    
    // We're going to revise it later_ query function
    // Note that the previous query was a string, but now it is an array
    const query = this._query;
    const { caseSensitive, entireWord, phraseSearch } = this._state;

    if (query.length === 0) {
      // Do nothing: the matches should be wiped out already.
      return;
    }

    if (!caseSensitive) {
      pageContent = pageContent.toLowerCase();
      // Modified here and added a loop, because the current query is not a string, but an array
      for (let i = 0; i < query.length; i++) {
        query[i] = query[i].toLowerCase();
      }
    }

    if (phraseSearch) {
      this._calculatePhraseMatch(
        query,
        pageIndex,
        pageContent,
        pageDiffs,
        entireWord
      );
    } else {
      this._calculateWordMatch(
        query,
        pageIndex,
        pageContent,
        pageDiffs,
        entireWord
      );
    }

    // When `highlightAll` is set, ensure that the matches on previously
    // rendered (and still active) pages are correctly highlighted.
    if (this._state.highlightAll) {
      this._updatePage(pageIndex);
    }
    if (this._resumePageIdx === pageIndex) {
      this._resumePageIdx = null;
      this._nextPageMatch();
    }

    // Update the match count.
    const pageMatchesCount = this._pageMatches[pageIndex].length;
    if (pageMatchesCount > 0) {
      this._matchesCountTotal += pageMatchesCount;
      this._updateUIResultsCount();
    }
  }

Then we're going to modify it_ query content

  get _query() {
    const query = this._state.query;
    // The previous query statement was a string, but now it has been changed to an array, so it needs to be processed circularly
    if (typeof query === "object" && query.length !== 0) {
      for (let i = 0; i < query.length; i++) {
        if (query[i] !== this._rawQuery[i]) {
          this._rawQuery[i] = query[i];
          [this._normalizedQuery[i]] = normalize(query[i]);
        }
      }
    } else {
      // This is the original version. In fact, this branch is not used now, because it must be of obj type
      if (query !== this._rawQuery) {
        this._rawQuery = this._state.query;
        [this._normalizedQuery] = normalize(this._state.query);
      }
    }
    return this._normalizedQuery;
  }

Similarly, because we put this_ Rawquery and {this_ Normalizedquery has become an array type for use, so these two variables need to be defined before use, otherwise an error will be reported. It can be defined in the executeCommand function.

  executeCommand(cmd, state) {
    if (!state) {
      return;
    }
    const pdfDocument = this._pdfDocument;

    if (this._state === null || this._shouldDirtyMatch(cmd, state)) {
      this._dirtyMatch = true;
    }
    this._state = state;
    if (cmd !== "findhighlightallchange") {
      this._updateUIState(FindState.PENDING);
    }
    
    // Added the following two lines
    this._rawQuery = new Array(this._state.query.length);
    this._normalizedQuery = new Array(this._state.query.length);

    this._firstPageCapability.promise.then(
    // Too much content, omitted here, not pasted
    );
  }

Then we'll revise it_ The calculateWordMatch function mainly adds a loop

  _calculateWordMatch(query, pageIndex, pageContent, pageDiffs, entireWord) {
    const matchesWithLength = [];

    // Divide the query into pieces and search for text in each piece.

    // Changed to a cycle
    for (let x = 0; x < query.length; x++) {
      const queryArray = query[x].match(/\S+/g);
      for (let i = 0, len = queryArray.length; i < len; i++) {
        const subquery = queryArray[i];
        const subqueryLen = subquery.length;

        let matchIdx = -subqueryLen;
        while (true) {
          matchIdx = pageContent.indexOf(subquery, matchIdx + subqueryLen);
          if (matchIdx === -1) {
            break;
          }
          if (
            entireWord &&
            !this._isEntireWord(pageContent, matchIdx, subqueryLen)
          ) {
            continue;
          }
          const originalMatchIdx = getOriginalIndex(matchIdx, pageDiffs),
            matchEnd = matchIdx + subqueryLen - 1,
            originalQueryLen =
              getOriginalIndex(matchEnd, pageDiffs) - originalMatchIdx + 1;

          // Other searches do not, so we store the length.
          matchesWithLength.push({
            match: originalMatchIdx,
            matchLength: originalQueryLen,
            skipped: false,
          });
        }
      }
    }

    // Prepare arrays for storing the matches.
    this._pageMatchesLength[pageIndex] = [];
    this._pageMatches[pageIndex] = [];

    // Sort `matchesWithLength`, remove intersecting terms and put the result
    // into the two arrays.
    this._prepareMatches(
      matchesWithLength,
      this._pageMatches[pageIndex],
      this._pageMatchesLength[pageIndex]
    );
  }

At this point, our code will be modified.  

V. packaging PDF js

Now we're going to put the revised PDF JS package and put it in kkfileview, and let kkfileview call our modified instead of the default.

(1) package PDF js

Package PDF JS method is written in GitHub readme. Just run gulp generic in terminal.

 

Run gulp generic in terminal

Package succeeded

(2) kkfileview create a new folder

kkfileview uses the packaged PDF by default JS is placed in the folder server/src/main/resources/static/pdfjs. I created a new folder named pdfExt under server/src/main/resources/static to prepare the modified PDF JS files are packaged here.

(3) copy files

Then it's very simple. Put PDF Copy the contents in the build/generic directory of the JS project to the pdfExt folder created by kkfileview.

Operation effect

My query statement is "knowledge map"_ The results returned by analyze (that is, the results after word segmentation and other preprocessing of query statements) are "knowledge" and "map". These two words are my highlighted keywords. When you open the preview file, you can see that all words related to "knowledge" and "Atlas" are highlighted. And you can automatically locate the location where the first keyword appears when you open the file.

 

epilogue

All the modifications are completed here. You can run it to try the effect. The first time to write a blog, please forgive the imperfections and hope to help you. what? You said I only talked about how to achieve highlighting, not how to achieve positioning? It can be found after running, PDF JS has its own search function, which has helped us to locate the keyword. As long as you can successfully transfer the highlighted keyword according to the above method, the problem of keyword positioning will be solved naturally

Finally, I would like to thank the Kaohsiung leaders in kk open source technology exchange group 2 for providing me with ideas for modification and helping me find a reference blog.

Topics: Javascript ElasticSearch