[original] python artifact for calculating Chinese text similarity

Posted by Major Tom on Wed, 08 Dec 2021 09:38:42 +0100

introduce

Recently, due to work needs, a function needs to be used, that is, the calculation of Chinese text similarity. It belongs to an application in the nlp field. Here we find a very good package to share with you. This package is called sense transformers.

Here's how to use this package to calculate the similarity of Chinese text (it's just a small use of this package)

  1. The model used here is the paraphrase-multilingual-MiniLM-L12-v2 model, because the paraphrase-MiniLM-L6-v2 model is very good. The paraphrase-multilingual-MiniLM-L12-v2 is a multilingual version of the paraphrase-MiniLM-L6-v2 model, which is fast, effective and supports Chinese!

  2. The method to calculate the similarity here is the cosine similarity.

Use steps

  1. The first step is to install this package. You can install it directly using pip:
pip install sentence-transformers
  1. Import package
import sys
from sentence_transformers.util import cos_sim  
from sentence_transformers import SentenceTransformer as SBert
  1. Use model
model = SBert('paraphrase-multilingual-MiniLM-L12-v2')

Because in China, visiting some model websites may fail, resulting in the following results:

HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1129)')))

Then we can change to this method: first download the model, then unzip it into a folder, and then directly transfer the folder path.

First go to the model website to download the model: the link of the model website is: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/

Then find the model name paraphrase-multilingual-MiniLM-L12-v2 and click download.

Then unzip the model to the paraphrase-multilingual-MiniLM-L12-v2 folder. Then put the folder path into the following model.

model = SBert("C:\\Users\xxxx\Downloads\\paraphrase-multilingual-MiniLM-L12-v2")
  1. Calculation results

The following content is very simple. Two lists are passed. encode the text in each list, then calculate the cosine similarity, and finally output the results.

# Two lists of sentences
sentences1 = ['How to change a bank card',
              'The cat sits outside',
              'A man is playing guitar',
              'The new movie is awesome']

sentences2 = ['Change the binding bank card',
              'The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

# Compute embedding for both lists
embeddings1 = model.encode(sentences1)
embeddings2 = model.encode(sentences2)

# Compute cosine-similarits
cosine_scores = cos_sim(embeddings1, embeddings2)
cosine_scores

What's your feeling?

In fact, I just planned to use the text2vec package. I think the package is very good and powerful. So I wanted to see the source code of his package. The moment I opened the source code, I was stunned. I didn't expect the code to be so clean and beautiful.

But I found out later that I jumped to the sense transformers package 😂, It turns out that many beautiful codes just now are sent transformers packages 😂.

The text2vec package is also very good, and the sense transformers package is even better!!

I only provide a simple use of the sense transformers package here. You can carefully read the source code of the package, which is very worth learning ~

Reference link

  1. https://github.com/UKPLab/sentence-transformers
  2. https://github.com/shibing624/text2vec

Read more

list

Topics: Python NLP