minio sdk based on nodejs, breakpoint continuation scheme of self built service transfer file

Posted by superdan_35 on Tue, 01 Mar 2022 09:05:07 +0100

background

  • The company's business problems need no more elaboration. The general demand is to need a breakpoint continuation function
  • The company's file storage is used amazon┬ás3 , this is a general-purpose file storage, similar to Alibaba cloud oss. A file can be regarded as an object and can do some operations for file objects. amazon s3 is a service that can create a bucket for storing files. For a bucket, there are three levels of permissions: 1. Public read and public write; 2. Public reading and private writing; 3. Private read, private write
  • Why does this function need to be done by the front end? Because the back-end only provides data interfaces and does not provide web services, web services are developed and solved by the front-end itself
  • If you don't know amazon s3 or alicloud oss, you may have difficulty reading the article

Simple file upload

amazon's s3, there are several schemes for file direct upload, which are divided into front-end direct upload and sdk upload

1, Front end direct upload, divided into public write and private write

Public write: you can directly splice the url: http: / / \ < address \ > / \ < bucketname \ > / \ < objectname \ >. You can directly send a put request to the file at the front end. Where address is the domain name provided by the service, bucketname is our bucket name, and objectname is the object name (i.e. file name)

Private write: minio's sdk must be used. Use the presignedPutObject method provided by the sdk to generate an upload address. After the front end gets the link, it can also directly initiate a put request at the front end. Note that the sdk can only run on the server, that is, it must have its own service

2, Using sdk to upload, sdk can only run on the server. The process is basically that the front end uploads files to the server, and the server uses sdk to upload files to S3 file service

In these two schemes, except for the case of public writing, others need to have self built servers. Moreover, from the perspective of security, it is unreasonable to set public writing

Design of breakpoint continuation scheme

To do the function of breakpoint continuation, I must have my own service. Like most people, when doing the function of breakpoint continuation, I also went to the Internet to search other people's practice, and then found the appropriate method. Combined with our project, I designed such a scheme:

  1. After the user selects a file, the front end uses file Slice splits the file
  2. Calculate the file hash, that is, MD5 of the file. As long as the file content remains unchanged, the file hash will not change. Calculate the hash is a synchronization task. If the file is too large, the browser will get stuck. I use the api of spark-md5 + browser requestIdleCallback to solve it. I can use webworker according to the situation of my own project
  3. Use hash to query whether there is a corresponding file. You can have your own database to store the file link corresponding to hash. If you use hash as the file name, you can also call the statObject method of minio's sdk to know whether the file has been uploaded. If you already have file information, you can get the link directly without uploading the file, This is the function of second transmission
  4. If it is not uploaded, the local server will read which fragments have been uploaded under the folder named after this hash, and the uploaded fragments will be returned to the front end. The front end can selectively upload the non uploaded fragments. This is the function of breakpoint continuation
  5. When the front-end uploads the partition, it will give the file's hash, partition number and other information to the server as parameters. After the server gets the file, it will name the folder with hash, name the partition with the partition number, and store the partition in the server
  6. After uploading all pieces, merge the files on the server, and then call the putObject method of minio to upload the files to the S3 file server. After uploading, delete the local files of the server

The whole function of continuous transmission of breakpoints has been completed, but there is an imperfection in this scheme, that is, the fragment files should be stored and merged at the server, and then uploaded to the file service. As far as I know, the standard S3 itself has the function of continuous transmission of breakpoints and merging files. Can you upload the fragment to the S3 file server when uploading the fragment directly, After the fragments are uploaded, merge the files directly in the S3 file server? The answer is in the back, but there was really no usable solution at that time. The only closest solution was the api provided by Baidu's Intelligent Cloud: https://cloud.baidu.com/doc/B... However, minio's sdk does not provide the uploadPart method, and this method will not work, so we have to do it first

problem

There is a fatal problem that is not considered in the above scheme, that is, there are multiple machines on the line, and these fragments will be uploaded to different machines. When merging files, it is impossible to merge, resulting in the failure of the above designed scheme. This problem is only considered when a fragment is not found when solving a merged file, Therefore, we should reconsider the new scheme and focus on how to upload the uploaded fragments directly to the S3 file server and merge them on the S3 file server

In order to solve this problem, I looked at the source code of minio and putObject. After looking at the source code of putObject, I learned that the core process of putObject is as follows:

  1. Use block-stream2 to block files
  2. Use objectName to call findUploadId to query uploadId. If there is no uploadId, initiate newmultipartupload will be called to initialize an uploadId. Some information can be obtained through some tests:

    2.1 the uploadId returned by each call of initiateNewMultipartUpload is different

    2.2 findUploadId will return the latest uploadId

    2.3 by searching other information, we know that uploadId has a validity period of 7 days

  3. Call listParts to get the uploaded parts
  4. Combine the parameters and call makeRequest to upload the fragment
  5. Call completeMultipartUpload to complete fragment upload. This method will merge all fragments and return the merged file etag

New scheme

By looking at the source code of the putObject method, we can make some modifications to our existing scheme

  1. After the user selects the file, the file is segmented
  2. Calculate the file hash, which is the same as before. Take the file hash as the new file name or add a fixed prefix, which must be named according to fixed rules. It is best not to have inconsistent names for the same file, because the server name of minio is the only key
  3. Use the file name to retrieve whether the file already exists, mainly by calling the statObject method of minio. If it does not exist, obtain the uploadId with the file name, Then use uploadId to obtain the uploaded fragments (this is different from the previous one, because the partition file does not exist locally on the server, so the partition information should be stored in the database. In fact, you can call the listParts method of sdk to obtain the uploaded partition, but the information returned by calling listParts does not bring the expected partNumber parameter, which may be the reason for the S3 service built by the company, so the partition can only be stored.)
  4. After the front-end gets the uploaded information, it is consistent with the previous processing. If there is a file, it will not be uploaded. Otherwise, the calculation needs to upload fragments and then upload
  5. Develop an uploadPart method by yourself. After receiving the fragment, the server will get the ArrayBuffer of the fragment file, get the uploadId, assemble the parameters, call the makeRequest method of sdk to upload the fragment to the S3 file server, delete the file fragment after uploading, and store the uploaded fragment information
  6. The front end accepts all segments to be uploaded, and calls the merged file interface, the server merges the files, and calls the completeMultipartUpload method of sdk. It will merge all the files on the S3 file server server.

Now the new scheme is completed, and some codes are pasted below

front end:

File fragmentation:

function createChunks(file, size = SINGLECHUNKSIZE) {
    let cur = 0,
        index = 1;
    const chunks = [];
    while (cur < file.size) {
        chunks.push({
            start: cur, // Bytes of file start position
            file: file.slice(cur, cur + size), // Fragment file
            hash: "", // File hash
            progress: 0, // Upload progress
            uploaded: false, // Has it been uploaded
            index: index, // How many slices
        });
        index++;
        cur += size;
    }

    return chunks;
}

Calculation file hash:

const md5ByRequestIdle = (chunks, { onProgress }) => {
    return new Promise((resolve) => {
        const spark = new SparkMD5.ArrayBuffer();
        let count = 0;
        const workLoop = async () => {
            if (count < chunks.length) {
                const reader = new FileReader();
                reader.onload = e => {
                    const add = (deadline) => {
                        if (deadline.timeRemaining() > 1) {
                            spark.append(e.target.result);
                            count++;
                            const progress = parseInt((count / chunks.length) * 100) / 100;
                            if (count < chunks.length) {
                                onProgress && onProgress(progress);
                                window.requestIdleCallback(workLoop);
                            } else {
                                onProgress && onProgress(1);
                                resolve(spark.end());
                            }
                        } else {
                            window.requestIdleCallback(add);
                        }
                    }
                    window.requestIdleCallback(add)
                }
                reader.readAsArrayBuffer(chunks[count].file);
            } else {
                resolve(spark.end());
            }
        }
        window.requestIdleCallback(workLoop);
    });
}

Server:

Upload file fragment:

async uploadPart(file, index, filename, uploadId?) {
    const part = Number(index);
    if (!uploadId) {
      uploadId = await this.getUploadIdByFilename(filename)
    }
    const curList = await this.ctx.model.etagCenter.findBy({
      filename,
      part,
      uploadId,
    })
    if (curList.length > 0) {
      return true
    }
    const client = new Client({
      endPoint: this.config.S3 File server v3.endPoint,
      accessKey: this.config.S3 File server v3.accessKey,
      secretKey: this.config.S3 File server v3.secretKey,
    })
    const chunk = await fse.readFile(file.filepath)
    const query = querystring.stringify({
      partNumber: part,
      uploadId,
    })
    const options = {
      method: 'PUT',
      query,
      headers: {
        'Content-Length': chunk.length,
        'Content-Type': mime.lookup(filename),
      },
      bucketName: this.config.S3 File server v3.bucketName,
      objectName: filename,
    }
    const etag = await new Promise((resolve, reject) => {
      client.makeRequest(options, chunk, [200], '', true, function(
        err,
        response,
      ) {
        if (err) return reject(err) // In order to aggregate the parts together, we need to collect the etags.

        let etag = response.headers.etag
        if (etag) {
          etag = etag.replace(/^"/, '').replace(/"$/, '')
        }
        fse.unlink(file.filepath)
        resolve(etag)
      })
    })
    const insertResult = await this.ctx.model.etagCenter.add({
      filename,
      etag,
      part,
      uploadId,
    })
    return insertResult.insertedCount > 0
}

There are still many areas for improvement in the scheme and code, and they are still thinking about it

Subsequent optimization

At present, fragment files need to be uploaded to our own service first, and then uploaded to S3 file server. It will take some time and efficiency. Later, it is expected to upload fragment files directly to S3 file server at the front end. Of course, on the premise of ensuring security, but there is no time to consider this at present, If it works, the subsequent optimization will move in this direction

Topics: Javascript node.js