E-book analysis

Posted by mikegzarejoyce on Mon, 03 Jan 2022 15:25:01 +0100

E-book analysis

Idea: when uploading e-books, multer middleware is used to complete the uploading process. After uploading, a file object will be generated in req. This file object represents an array to represent the sequence of files. The file object contains the file object, which contains the file name, file path, file resource type, etc, After receiving this information, you can generate a Book object through this information. The book object here is the so-called e-book object, and then complete the parsing process through the book object

models/Book.js

The book here represents an e-book. It must provide us with some capabilities, including creating objects from files. Another is that when editing, it needs to be able to turn it into a Book object according to the data of the form

What are the benefits of becoming a Book object? After parsing the book object, you can write some methods, such as parse method, to parse the book object, and you can parse some of the details, such as language, title, creator, etc,

It can parse the e-book directory, convert the book object into json format (which can be directly used by the front end), convert it into the database field name, and quickly generate some sql statements. Therefore, it is very important that the book object corresponds to us to develop the whole e-book parsing part. Therefore, a large part of e-book parsing is writing book objects.

E-book object development

Incoming file means that an e-book file has just been uploaded. If you pass in a data, it means updating or inserting e-book data. Data means inserting data into the database. File is mainly used to parse e-book data

router/book.js call

After you know the contents of the file object, you can parse it

mimetype can give him a default type

We need to change the name of the file because we found the returned file The path path has no suffix, so it will be troublesome to identify this file

The download path definition URL of the generated file is constant js

Upload has been changed here_ Path does not need two backslashes, and a '/' is OK

An ancient bug upload is found here_ There should not be \ book after path

    // Generate a file download path through which you can quickly download to the e-book
    const url=`${UPLOAD_URL}/book/${filename}${suffix}`

The same applies to the unzipped folder

const {MIME_TYPE_EPUB,UPLOAD_URL,UPLOAD_PATH}=require('../utils/constant')
class Book{
  constructor(file,data){
    if(file){
      this.createBookFromFile(file)
    }else{
      this.createBookFromData(data)
    }
  }
  createBookFromFile(file){
    // console.log("createBookFromFile",file);
    const{
      destination,
      filename,
      mimetype=MIME_TYPE_EPUB,
      path
    }=file
    // E-book file suffix
    const suffix=mimetype===MIME_TYPE_EPUB?'.epub':''
    // Original path of e-book
    const oldBookPath=path    // Original path
    // E-book new path
    const bookPath=`${destination}\\${filename}${suffix}` //New path 
    // Generate a file download path through which you can quickly download to the e-book
    // Download URL of e-book
    const url=`${UPLOAD_URL}/book/${filename}${suffix}`
    // Generate an e-book decompression folder named after the file name
    // Folder path after e-book decompression
    const unzipPath=`${UPLOAD_PATH}\\unzip\\${filename}`
    // This url path will be used when reading the e-book
    // Folder URL after e-book decompression
    const unzipUrl=`${UPLOAD_URL}/unzip/${filename}`

  }
  createBookFromData(){

  }
}
module.exports=Book

Next, you can create the decompression folder of the e-book

  if(!fs.existsSync(unzipPath)){
      // Iterate to create a folder if it does not exist
      fs.mkdirSync(unzipPath,{recursive:true})
    }

Next, the extracted files will be thrown under this path

Rename file

 // Judge whether the current e-book exists. If it exists and the new e-book does not exist 
 // Call rename to rename the folder, and pass oldBookPath and bookPath to rename 
if(fs.existsSync(oldBookPath)){
      fs.renameSync(oldBookPath,bookPath)
    }

Next, define some properties of the book object according to some fields required by the front end

 this.filename=filename  // File name without suffix
    // Write relative paths to be compatible with different scenarios, because the absolute paths of the server and the client are different
    this.path=`/book/${filename}${suffix}` // epub file relative path
    this.filePath=this.path    // Create an alias
    this.unzipPath=`/unzip/${filename}`  // Relative path after decompression
    this.url=url     // epub file download link
    this.title=''   // Title or book title, generated after parsing 
    this.author=''  
    this.publisher=''    // press
    this.contents=[]    // catalogue
    this.cover=''     // Cover image url
    this.category=-1    // Classification id
    this.categoryText=''  // Classification name
    this.language=''    // languages
    this.unzipUrl=unzipUrl   // Unzipped folder link
    this.originalname=originalname  // Original name

Take a look at the result (here both backslashes should be changed to /)

E-book parsing library epub Library

epubjs library is used for browser scenes. It cannot work without the browser, because it mainly renders the browser in the browser scene

The epub library here is used in the node environment

https://www.youbaobao.xyz/admin-docs/guide/extra/book.html#%E7%94%B5%E5%AD%90%E4%B9%A6%E8%A7%A3%E6%9E%90-2

https://github.com/julien-c/epub/blob/master/epub.js

Because his code needs to be modified, it is copied and integrated into the project, not installed through npm package

utils/epub.js

Install ADM zip xml2js

Epub class provides a parse method

The parse method is used in actual parsing

Let's see. The usage method is implemented using event

After passing in, a callback method is used

The following function is the callback after successful parsing

When to start manual parsing? You need to call ePub parse()

Call the getChapter method through the epub instance, and then call a callback

Author, this information can be obtained from metadata

After successful parsing, you can use ePub Metadata get

flow is the order in which the entire e-book is rendered

getChapter get chapter (pass in Chapter id) get the text corresponding to the chapter

getChapterRaw represents the original text obtained, that is, a file in html format

getImage passes in the picture id to get the actual content of the picture

getFile passes in the css id and gets the css file

Because there are a lot of callbacks, it will be reformed later

E-book analysis method

model/Book.js into epub Library

Add a parse method. We have added many attributes to Book, but many of them are default values. They are parsed in parse and then filled in

  parse(){
    return new Promise((resolve,reject)=>{
      const bookPath=`${UPLOAD_PATH}${this.filePath}1`
      // If there is no file path, an error is thrown
      if(!fs.existsSync(bookPath)){
        reject (new Error('E-book does not exist'))
      } 
    })
  }

Test it

router/book.js

For verification, change the path casually

The front end is stuck here because there is no returned content. You can use boom to quickly generate exception objects

.catch(err => {
        console.log("upload", err);
        // Tell the front end that a parsing exception has occurred
        next(boom.badImplementation(err))
      })

That is, the error wrapped with reject in the Book object will be returned to the next through the route, and then captured by the custom exception and returned to the front end. The front end will handle it accordingly, so that the exception thrown by the server can be caught by the front end

E-book analysis method

After seeing that the parse method is called, he will call an open method

In model / book js:

consumption

After reject

Then you will go to custom exception handling

test

Change the bookPath back

Print it out ePub metadata

The information in the metadata needs to be parsed

Print book

epub object:

containerFile: the first file parsed by epub. Find content based on this file opf

rootFile: content The location of OPF, because when reading e-books, you actually need to parse content opf

As long as you can find content OPF, then the later process will be easy

manifest: resource file you can find the cover image through the resource file

toc: Directory

book object printing

Get cover page: getImage method provided by epub Library

This method needs to pass in two parameters, one is id and the other is callback

id is the id corresponding to the cover image

We just want to copy the pictures in the href to the img folder under the nginx directory. In this way, we can get the url link, which can be used as a link to the cover picture

Analyze the source code of getImage method

    getImage(id, callback) {
      // Find link to manifest
        if (this.manifest[id]) {
              // If the media type exists, the first six characters will be intercepted and compared with image /. If they are not equal, an exception will be thrown and an error object will be passed in the callback
            if ((this.manifest[id]['media-type'] || "").toLowerCase().trim().substr(0, 6)  !=  "image/") {
                return callback(new Error("Invalid mime type for image"));
            }
            // If it is an image, call getFile to pass in the id and callback
            this.getFile(id, callback);
        } else {
            callback(new Error("File not found"));
        }
    };

Take a look at how getFile is implemented

Let's use the getImage method to renovate the library then or async await is better. Here, use callback first

In this way, the data has been obtained (it has been read into the memory but not in the disk)

suffix is obtained according to mimetype

write file

Punch in book JS code

const { MIME_TYPE_EPUB, UPLOAD_URL, UPLOAD_PATH } = require('../utils/constant')
const fs = require('fs')
const Epub = require('../utils/epub')
class Book {
  constructor(file, data) {
    if (file) {
      this.createBookFromFile(file)
    } else {
      this.createBookFromData(data)
    }
  }
  createBookFromFile(file) {
    console.log("createBookFromFile", file);
    const {
      destination,
      filename,
      mimetype = MIME_TYPE_EPUB,
      path,
      originalname
    } = file
    // E-book file suffix
    const suffix = mimetype === MIME_TYPE_EPUB ? '.epub' : ''
    // Original path of e-book
    const oldBookPath = path    // Original path
    // E-book new path
    const bookPath = `${destination}/${filename}${suffix}` //New path 
    // Generate a file download path through which you can quickly download to the e-book
    // Download URL of e-book
    const url = `${UPLOAD_URL}/book/${filename}${suffix}`
    // Generate an e-book decompression folder named after the file name
    // Folder path after e-book decompression
    const unzipPath = `${UPLOAD_PATH}/unzip/${filename}`
    // This url path will be used when reading the e-book
    // Folder URL after e-book decompression
    const unzipUrl = `${UPLOAD_URL}/unzip/${filename}`
    // If unzipPath does not exist, create it
    if (!fs.existsSync(unzipPath)) {
      // Iterate to create a folder if it does not exist
      fs.mkdirSync(unzipPath, { recursive: true })
    }
    // Judge whether the current e-book exists. If it exists and the new e-book does not exist 
    // Call rename to rename the folder, and pass oldBookPath and bookPath to rename
    if (fs.existsSync(oldBookPath)) {
      fs.renameSync(oldBookPath, bookPath)
    }
    this.filename = filename  // File name without suffix
    // Write relative paths to be compatible with different scenarios, because the absolute paths of the server and the client are different
    this.path = `/book/${filename}${suffix}` // epub file relative path
    this.filePath = this.path    // Create a relative path
    this.unzipPath = `/unzip/${filename}`  // Relative path after decompression
    this.url = url     // epub file download link
    this.title = ''   // Title or book title, generated after parsing 
    this.author = ''
    this.publisher = ''    // press
    this.contents = []    // catalogue
    this.cover = ''     // Cover image url
    this.coverPath=''
    this.category = -1    // Classification id
    this.categoryText = ''  // Classification name
    this.language = ''    // languages
    this.unzipUrl = unzipUrl   // Unzipped folder link
    this.originalname = originalname  // Original name

  }
  createBookFromData() {
  }
  parse() {
    return new Promise((resolve, reject) => {
      const bookPath = `${UPLOAD_PATH}${this.filePath}`
      // If there is no file path, an error is thrown
      if (!fs.existsSync(bookPath)) {
        reject(new Error('E-book does not exist'))
      }
      // Create an instance
      const epub = new Epub(bookPath)
      // error callback to judge whether there is any exception during parsing
      epub.on('error', err => {
        reject(err)
      })
      // The end event indicates that the e-book was successfully parsed
      epub.on('end', err => {
        if (err) {
          reject(err)
        } else {
          // console.log("epub+ ", epub.manifest);
          const {
            language,
            creator,
            creatorFileAs,
            title,
            cover,
            publisher
          } = epub.metadata

          if (!title) {
            reject(new Error('Book tag is empty'))
          } else {
            this.title = title
            this.language = language || 'en'   // Does not exist. The default is English
            this.author = creator || creatorFileAs || 'unknown'
            this.publisher = publisher || 'unknown'
            this.rootFile = epub.rootFile

            const handleGetImage =  (err, file, mimetype) =>{
              console.log(err, file, mimetype);
              if (err) {
                reject(err)
              } else {
                // You need to call resolve after the whole e-book is parsed, instead of directly calling getImage to resolve,
                //Because getImage may have errors, there will be problems when you call the reject logic after you call resolve
                const suffix = mimetype.split('/')[1]
                const coverPath = `${UPLOAD_PATH}/img/${this.filename}.${suffix}`
                const coverUrl = `${UPLOAD_URL}/img/${this.filename}.${suffix}`
                // Write buffer to disk
                console.log(coverPath);
                fs.writeFileSync(coverPath,file,'binary')
                this.coverPath=`/img/${this.filename}.${suffix}`
                this.cover=coverUrl
                resolve(this)
              }
            }
            epub.getImage(cover, handleGetImage)
             // resolve(this) don't write here
          }
        }
      })
      epub.parse()
    })
  }
}
module.exports = Book

Analysis and optimization of cover image

Some e-books can't get the cover picture in this way

See where this error occurs

Print cover

Extract it and analyze it

Open package opf

There is no label in the metadata. It is a description of cover. There is no way to obtain the resource id of the cover image

Look in the manifest

You can see the resource file of the cover, but it is of xhtml type, which indicates that it is the content of the chapter, not the picture. The picture should start with image

This is the cover picture

It is the file beginning with 978 in the image path

Therefore, another way to query the cover page is to read the properties under item. If cover image means the cover page image, you can get the href of the image, then find its resource file, extract it from epub and save it locally

EPub Transformation of getImage method of JS

If the manifest cannot be obtained from the cover, the logic here needs to be improved

Rough frame

How to get coverId

   const coverId=Object.keys(this.manifest).find(key=>{  //Note that these are not curly braces
      
        // console.log(key,this.manifest[key]);
        this.manifest[key].properties==='cover-image'
      })

getImage(id, callback) {
    // Find link to manifest
    if (this.manifest[id]) {
      // If the media type exists, the first six characters will be intercepted and compared with image /. If they are not equal, an exception will be thrown and an error object will be passed in the callback
      if ((this.manifest[id]['media-type'] || "").toLowerCase().trim().substr(0, 6) != "image/") {
        return callback(new Error("Invalid mime type for image"));
      }
      // If it is an image, call getFile to pass in the id and callback
      this.getFile(id, callback);
    } else {
      // The passed in id cannot be used to determine whether the coverId exists by obtaining the coverId
      // This returns the qualified key
      const coverId = Object.keys(this.manifest).find(key => (
        // console.log(key,this.manifest[key]);
        this.manifest[key].properties === 'cover-image'
      ))
      console.log("coverId", coverId);
      if (coverId) {
        this.getFile(coverId, callback)
      } else {
        callback(new Error("File not found"));
      }

    }
  };

Next, develop a more difficult point - parsing the e-book directory

The epub library does not provide a solution. Although the manifest directory has many resource files, it does not form an order. We need to determine the hierarchical relationship of the directory

Directory analysis principle and e-book decompression

Directory parsing principle

First, get the toc attribute (the resource id of the directory) from under the spin tag

Then find it in the manifest

Open TOC ncx

navMap: navigation

There are all directories inside. Directories may be nested

1. Extract the e-book file

Unzip it and put it in the unzip folder

Through the previous getFile method, we can directly obtain the e-book file, but we choose to decompress it first, so that the reading efficiency will be higher

Come to the Book class written by yourself,

Write unzip method

 unzip(){
    const AdmZip=require('adm-zip')
    const zip=new AdmZip(Book.genPath(this.path))
    // The api extractAllTo() of the zip object means to decompress the files under the path,
    // After decompression, put it under a new path. The second parameter is whether to overwrite it
    zip.extractAllTo(Book.genPath(this.unzipPath),true)
  }
  // Generate a static method to obtain the absolute path
  static genPath(path){
    if(!path.startsWith('/')){
      path=`/${path}`
    }
    return `${UPLOAD_PATH}${path}`
  }
}

After extracting it, you can parse it

The unzip method is a synchronization method

After unzip, you can define a parseContents. Pass in the epub object because you need to go to the toc spin to find the toc attribute

  parseContents(epub){
    function getNcxFilePath(){
      const spine=epub&&epub.spine
      console.log("spine",spine);
    }
    getNcxFilePath()
  }

Print out spin

You can see that there is a toc attribute under the spine

You can find the id / corresponding to toc, or you can directly get the href

If you don't have a href, look for ID - > manifest

parseContents(epub){
    function getNcxFilePath(){
      const spine=epub&&epub.spine
      const manifest=epub&&epub.manifest
      const ncx=spine.toc&&spine.toc.href
      const id=spine.toc&&spine.toc.id
      console.log("spine", spine.toc,ncx,id,manifest[id].href);
      if(ncx){
        return ncx
      }else{
        // This must exist, because this is the directory of e-books
        return manifest[id].href
      }
    }
    getNcxFilePath()
  }

It can be found that both methods can get the directory

Then get the path

 const ncxFilePath=getNcxFilePath()

This is a relative path. You need to spell it into an absolute path

 const ncxFilePath=Book.genPath(getNcxFilePath())

This is still wrong. You need to add unzipPath

 const ncxFilePath=Book.genPath(`${this.unzipPath}/${getNcxFilePath()}`)
    console.log(ncxFilePath);

One more thing to do is to judge whether the path exists. If it does not exist, you need to throw an exception

    if(fs.existsSync(ncxFilePath)){

    }else{
      throw new Error('The resource file corresponding to the directory does not exist')
    }

catch here

Eventually, the front end will get an error message

Try it

   const ncxFilePath=Book.genPath(`${this.unzipPath}/${getNcxFilePath()+1}`)

E-book standard catalog analysis

Open TOC ncx

There is a navMap under the ncx object

Each navPoint under navMap is a directory option

navLabel: specific directory content

content: path of src directory, playOrder: Directory order

Directories may be nested, and we also need to identify secondary directories, so we need an iterative method to identify directories (difficulty)

Book.js first references the xml2js library

https://www.npmjs.com/package/xml2js

What we want to get is the navMap attribute under ncx

Print the navMap

Tip: look at the details

Paste string to JSON Cn inside

directory structure

The returned result is wrapped in an array. If you don't want to wrap it in the array, you can add a parameter,

       xml2js(xml,{
          explicitArray:false,
          ignoreAttrs:false
        },function(err,result){
          if(err){
            reject(err)
          }else{
            console.log(result) 
            const navMap=result.ncx.navMap
            console.log(JSON.stringify(navMap));
          }
        })

Current structure

The findParent method is added because it is a single-level directory, so it returns the same array, which will be improved in the future

    function findParent(array){
      return array.map(item=>{
        return item
      })
    }

If there are subdirectories, it is a tree structure, which is not conducive to the front-end display

Therefore, we need to change the tree structure into a one-dimensional structure. There is no such scenario yet, but we still need to build the method first

 navMap.navPoint=findParent(avMap.navPoint)
 const newNavMap=flatten(navMap.navPoint)

newNavMap is a shallow copy of navMap

    function findParent(array){
      return array.map(item=>{
        return item
      })
    }
    function flatten(array){
      return [].concat(...array.map(item=>{
        return item
      }))
    }

newNavMap is an array copied

epub.flow: display order

  epub.flow.forEach((chapter,index)=>{
                if(index+1>newNavMap.length){
                  // If the information in flow exceeds the directory information, return
                  return 
                }else{
                  // No more than
                  // Get directory information
                  const nav=newNavMap[index]
                  // Add an attribute (Chapter url)
                  chapter.text=`${UPLOAD_URL}/unzip/${fileName}/${chapter.href}`
                  console.log(chapter.text);
                }
              })
              console.log(epub.flow);
            }else{
              reject('Directory resolution failed, directory tree is 0')
            }

There's a problem. Just use ePub Can't flow

In fact, ePub Flow has some hidden pits

Some e-books do not have order and level, which is inaccurate

Therefore, get more authentic directory information from navMap

Continue adding attributes to the chapter

                  if (nav && nav.navLabel) {
                    chapter.label = nav.navLabel.text || ''
                  } else {
                    chapter.label = ''
                  }
                  chapter.navId=nav['$'].id
                  chapter.fileName=fileName
                  chapter.order=index+1
                  chapters.push(chapter)
                  console.log(chapter.text);

Nested directory resolution

Do some articles in findParent

The default level is 0 and the next level is 1, so you can indent according to the level when returning to the front end

Pass in three parameters: array,level=0,pid=0

There is no level field in navPoint. You can add a level field to it

    function findParent(array, level = 0, pid = '') {
      // There are three scenarios: 1. navPoint is not included: direct complex value, level,pid
      // 2:  There is a navPoint, and the navPoint is an array. The description contains subdirectories for iteration
      // 3:  When navPoint is not an array but an object (only one directory), it is directly assigned
      return array.map(item => {
        item.level = level
        item.pid = pid
        // Indicates that a subdirectory exists
        if (item.navPoint && item.navPoint.length) {
          item.navPoint = findParent(item.navPoint, level + 1, item['$'].id)
        } else if (item.navPoint) {
          item.navPoint.level = level + 1
          item.navPoint.pid = item['$'].id
        }
        return item
      })
    }

flatten method: change the navPoint array into a flat state

Cooperate here

If it remains flat, the length of newNavMap must be less than index+1(flow)

flatten method

    function flatten(array) {
      return [].concat(...array.map(item => {
        // If it contains an array
        if(item.navPoint&&item.navPoint.length>0){
          // merge
          return [].concat(item,...flatten(item.navPoint))
        }else if(item.navPoint){
          // If it's an object
          return [].concat(item,item.navPoint)
        }
        return item
      }))
    }

resolve reject

You can return the book to the front end

new Result(book 'upload succeeded') success(res)

Topics: Vue

Programmer Think

E-book analysis

E-book analysis

E-book object development

E-book parsing library epub Library

E-book analysis method

E-book analysis method

Analysis and optimization of cover image

Directory analysis principle and e-book decompression

E-book standard catalog analysis

Nested directory resolution

Hot Topics