Django's open source full-text search framework - Haystack

Posted by samuraitux on Mon, 03 Jan 2022 01:02:45 +0100

Haystack

1. What is Haystack

Haystack is Django's open-source full-text search framework (full-text search is different from fuzzy queries in specific fields, and it is more efficient to use full-text search). The framework supports Solr,Elasticsearch,Whoosh, **Xapian search engines. It is a pluggable back end (much like Django's database layer), so almost all the code you write can be easily switched between different search engines

Full text retrieval is different from the fuzzy query of specific fields. It is more efficient to use full-text retrieval, and can segment Chinese words
haystack: a package of django, which can easily index and search the content in the model. It is designed to support four full-text retrieval engine backend: whoosh, Solr, xapian and elasticsearc. It belongs to a full-text retrieval framework
whoosh: a full-text search engine written in pure Python. Although its performance is not as good as sphinx, xapian, Elasticsearc, etc., there is no binary package, and the program will not crash inexplicably. whoosh is enough for small sites
jieba: a free Chinese word segmentation package. If you find it difficult to use, you can use some paid products

2. Installation

pip install django-haystack
pip install whoosh
pip install jieba

3. Configuration

###Add Haystack to INSTALLED_APPS

Like most Django applications, you should add Haystack to installed in your settings file (usually settings.py)_ APPS. Example:

INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.sites',

    # add to
    'haystack',

    # Your app
    'blog',
]

###Modify settings py

In your settings Py, you need to add a setting to indicate the backend being used by the site configuration file and other backend settings. HAYSTACK - CONNECTIONS is a required setting and should be at least one of the following:

Solr example

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.solr_backend.SolrEngine',
        'URL': 'http://127.0.0.1:8983/solr'
        # ...or for multicore...
        # 'URL': 'http://127.0.0.1:8983/solr/mysite',
    },
}

Elasticsearch example

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
        'URL': 'http://127.0.0.1:9200/',
        'INDEX_NAME': 'haystack',
    },
}

Whoosh example

#You need to set the PATH to the file system location of your Whoosh index
import os
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.whoosh_backend.WhooshEngine',
        'PATH': os.path.join(os.path.dirname(__file__), 'whoosh_index'),
    },
}

# Auto update index
HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'

Xapian example

#First install the Xapian backend( http://github.com/notanumber/xapian-haystack/tree/master )
#You need to set the PATH to the file system location of your Xapian index.
import os
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'xapian_backend.XapianEngine',
        'PATH': os.path.join(os.path.dirname(__file__), 'xapian_index'),
    },
}

4. Data processing

Create index

If you want to do full-text search for an app such as blog, you must establish a search under the blog directory_ indexes. Py file. The file name cannot be modified

from haystack import indexes
from app01.models import Article

class ArticleIndex(indexes.SearchIndex, indexes.Indexable):
   #The class name must be the model to retrieve_ Name + index, you need to retrieve Article here, so create ArticleIndex
   text = indexes.CharField(document=True, use_template=True)#Create a text field 
   #Other fields
   desc = indexes.CharField(model_attr='desc')
   content = indexes.CharField(model_attr='content')

   def get_model(self):#Overload get_model method, must have!
       return Article

   def index_queryset(self, using=None):
       return self.get_model().objects.all()

Why create an index? An index is like a book directory, which can provide readers with faster navigation and search. The same is true here. When the amount of data is very large, it is almost impossible to find all that meet the search conditions from these data, which will bring a great burden to the server. Therefore, we need to add an index (directory) for the specified data. Here, we create an index for Note. We don't need to care about the implementation details of the index. As for which fields to create indexes and how to specify them, let's start to explain

There must be and only one field in each index with document=True, which means that haystack and search engines will use the content of this field as the index for retrieval (primary field). Other fields are only subsidiary properties, which are easy to call and are not used as retrieval data

Note: if a field is set document=True，It is generally agreed that this field is named text，This is in ArticleIndex Class is always named to prevent background confusion. Of course, you can change the name at will, but it is not recommended to change it.

In addition, we provide use in the text field_ template=True. This allows us to use a data template (rather than an error prone cascade) to build document search engine indexes. You should create a new template search / indexes / blog / article in the template directory_ text. Txt and put the following contents in it.

#Create a "model class name _text.txt" file under the directory "templates/search/indexes / application name /"
{{ object.title }}
{{ object.desc }}
{{ object.content }}

This data template is used for note title, Note. user. get_ full_ name,Note. The three fields of body are indexed. When searching, full-text search matching will be done for these three fields

5. Set view

Add SearchView to your URLconf

Add the following line to your URLconf:

(r'^search/', include('haystack.urls')),

This will pull the default URLconf of Haystack, which consists of URLconf pointing to the SearchView instance separately. You can change the behavior of this class by passing a few key parameters or completely reworking it.

Search template

Your search template (default in search/search.html) will probably be very simple. The following is enough for your search to run (your template/block should be different)

<!DOCTYPE html>
<html>
<head>
   <title></title>
   <style>
       span.highlighted {
           color: red;
       }
   </style>
</head>
<body>
{% load highlight %}
{% if query %}
   <h3>The search results are as follows:</h3>
   {% for result in page.object_list %}
{#        <a href="/{{ result.object.id }}/">{{ result.object.title }}</a><br/>#}
       <a href="/{{ result.object.id }}/">{%   highlight result.object.title with query max_length 2%}</a><br/>
       <p>{{ result.object.content|safe }}</p>
       <p>{% highlight result.content with query %}</p>
   {% empty %}
       <p>Nothing</p>
   {% endfor %}

   {% if page.has_previous or page.has_next %}
       <div>
           {% if page.has_previous %}
               <a href="?q={{ query }}&amp;page={{ page.previous_page_number }}">{% endif %}&laquo; previous page
           {% if page.has_previous %}</a>{% endif %}
           |
           {% if page.has_next %}<a href="?q={{ query }}&amp;page={{ page.next_page_number }}">{% endif %}next page &raquo;
           {% if page.has_next %}</a>{% endif %}
       </div>
   {% endif %}
{% endif %}
</body>
</html>

Note that page object_ List is actually a list of SearchResult objects. These objects return all the data of the index. They can be accessed through {{result.object}}. So {{result.object.title}} actually uses the Article object in the database to access the title field.

Rebuild index

Now that you've configured everything, it's time to put the data in the database into the index. Haystack comes with a command-line management tool that makes it easy.

Simple operation/ manage.py rebuild_index. You will get statistics on how many models have been processed and put into the index.

6. Use jieba participle

#Establish Chinese analyzer Py file
#Save it in the installation folder of haystack, such as "D: \ Python 3 \ lib \ site packages \ haystack \ backups"

import jieba
from whoosh.analysis import Tokenizer, Token

class ChineseTokenizer(Tokenizer):
    def __call__(self, value, positions=False, chars=False,
                 keeporiginal=False, removestops=True,
                 start_pos=0, start_char=0, mode='', **kwargs):
        t = Token(positions, chars, removestops=removestops, mode=mode,
                  **kwargs)
        seglist = jieba.cut(value, cut_all=True)
        for w in seglist:
            t.original = t.text = w
            t.boost = 1.0
            if positions:
                t.pos = start_pos + value.find(w)
            if chars:
                t.startchar = start_char + value.find(w)
                t.endchar = start_char + value.find(w) + len(w)
            yield t


def ChineseAnalyzer():
    return ChineseTokenizer()

#Copy whoosh_backend.py file, renamed whoosh_cn_backend.py
#Note: there will be a space at the end of the copied file name. Remember to delete this space
from .ChineseAnalyzer import ChineseAnalyzer 
lookup
analyzer=StemmingAnalyzer()
Change to
analyzer=ChineseAnalyzer()

7. Create a search bar in the template

<form method='get' action="/search/" target="_blank">
    <input type="text" name="q">
    <input type="submit" value="query">
</form>

8. Other configurations

Add more variables

from haystack.views import SearchView  
from .models import *  
      
class MySeachView(SearchView):  
     def extra_context(self):       #Overload extra_context to add additional context content  
         context = super(MySeachView,self).extra_context()  
         side_list = Topic.objects.filter(kind='major').order_by('add_date')[:8]  
         context['side_list'] = side_list  
         return context  

        
#Route modification
url(r'^search/', search_views.MySeachView(), name='haystack_search'),

Highlight

{% highlight result.summary with query %}  
# Here, you can limit the length of the final {{result.summary}} after being highlighted  
{% highlight result.summary with query max_length 40 %}  

#html
    <style>
        span.highlighted {
            color: red;
        }
    </style>

Topics: Django ElasticSearch

Programmer Think