[news recommendation system] (task3) automatically build user and material portraits

Posted by Tjorriemorrie on Wed, 22 Dec 2021 05:37:25 +0100

***

(1) Learn how to build the portraits of [material] and [user] in the form of automation when the news recommendation system is offline:

(2)\fun-rec-master\codes\news_recsys\news_rec_server\materials\material_process

  • log processing
  • News portrait processing
  • News portrait saved in redis

(3) User portrait processing
\fun-rec-master\codes\news_recsys\news_rec_server\materials\user_process

Note: after Redis's value is stored in Chinese, the hexadecimal string "\ xe4\xb8\xad\xe5\x9b\xbd" is displayed after get. How to solve it?
When redis cli is started, add – raw after it, and Chinese characters will display normally, as shown above.

(1) Construct news portrait: obtain materials through crawlers and process the crawled data, that is, construct news portrait.
(2) For the user side portrait, we need to add newly registered users to the user portrait library every day. For users who have generated behavior in the system, we also need to update the user's portrait regularly (long and short term).

1, Construction of material side portrait

1.1 new material source

Material source: obtained from the scratch crawler of the previous task.

be careful:
(1) News crawling is to crawl the news of the previous day every morning. The reason for this is that you can climb more materials;
(2) The disadvantage is that the timeliness of materials will be delayed by one day, and the newly crawled materials will be stored in MongoDB.

1.2 update of material image

The update of material image mainly includes the following aspects:

(1) Add new material to material library
(2) Old material portraits are updated through user interaction records

Logic for adding new materials to the material library:
(1) After the news sweep crawler crawls, new materials are added to the material warehouse;
(2) After simple image processing (stored in mongodb) for new materials, the currently defined image fields are as follows:

(1) Specific logic for adding new materials to the material warehouse

(1) Traverse all articles crawled today;
(2) Judge whether the article is repeated in the material warehouse through the article title. If it is repeated, it will be de duplicated (that is, the title will be saved in advance for de duplication);
(3) Initialize the fields corresponding to the portrait according to the defined fields;
(4) Stored in the portrait material pool.

For updating the old material image, you need to know which fields of the old material will be updated by the user's behavior. The following is the news list display page. We will find that the front end will display the reading, favorite and collection times of news. User interactions (reading, compliments, and favorites) change these values.

In order to display the dynamic behavior information of news on the front end in real time, we store the dynamic information of news in redis in advance. When online, we directly obtain the news data from redis. If users interact with the news, these dynamic information will be updated, and we also directly update the value in redis, This is mainly to enable the front end to obtain the latest dynamic portrait information of the news in real time.

The dynamic portrait of news is updated in redis, which is a memory database, and the resources are very valuable. We can't store the news information in it all the time, but update it once a day, Update only the news that may be displayed today (some news may have been published for too long, so it is not necessary to show it again due to the timeliness of the news). Therefore, in order to save the dynamic information of news history, the system also needs to update the dynamic news information in redis to the news portrait database stored in mongodb every day. The logic here will be triggered regularly every day, and the logic here will be placed in the updated new material Of course, the sequence of the two here has no impact. You only need to pay attention to updating the material dynamic portrait before the redis data is cleared.

(2) 2 news portrait Libraries


The portrait Library of news actually has two:

  • One is called the feature library featureprotrack, which stores all the fields of the material.
  • There is also RedisProtrail, a portrait library that stores the front-end display content. The materials in this portrait library are the same, but the content stored in each material is different. The content of this feature library is updated every day as a backup of the news content stored in redis.

Therefore, after updating the portraits of new and old materials, we need to write a copy of the news information in the latest material library to the redistrail material library, and remove some unnecessary fields in the front-end display.

(3) Updated core code of material

1. Material update

File path: D: \ desktop file \ fun rec master \ codes \ news_ recsys\news_ rec_ server\materials\material_ process\news_ protrait. py

# -*- coding: utf-8 -*-
from re import S
import sys
import json
sys.path.append("../")
from material_process.utils import get_key_words
from dao.mongo_server import MongoServer
from dao.redis_server import RedisServer

"""
Fields included in the news portrait:
0. news_id journalistic id
1. title title
2. raw_key_words (The key words that climb down may be missing)
3. manual_key_words (Keywords generated from content)
4. ctime time
5. content Specific news content
6. cate News category
7. likes Number of news likes
8. collections Number of news collections
9. read_nums Reading times
10. url Original news link
"""

class NewsProtraitServer:
    def __init__(self):
        """Initialization related parameters
        """
        self.mongo_server = MongoServer()   
        self.sina_collection = self.mongo_server.get_sina_news_collection()
        self.material_collection = self.mongo_server.get_feature_protrail_collection()
        self.redis_mongo_collection = self.mongo_server.get_redis_mongo_collection()
        self.news_dynamic_feature_redis = RedisServer().get_dynamic_news_info_redis()

    def _find_by_title(self, collection, title):
        """Search the database for news data with the same title
        Data return of current title in database True, Conversely, return Flase
        """
        # The find method returns an iterator
        find_res = collection.find({"title": title})
        if len(list(find_res)) != 0:
            return True
        return False

    def _generate_feature_protrail_item(self, item):
        """Generate feature portrait data and return a new dictionary
        """
        news_item = dict()
        news_item['news_id'] = item['news_id']
        news_item['title'] = item['title']
        # The keywords extracted from the news content are not as accurate as those in the original news crawling, so the keywords extracted manually
        # Just as a supplement, it can be used when there are no keywords in the original news
        news_item['raw_key_words'] = item['raw_key_words']
        key_words_list = get_key_words(item['content'])
        news_item['manual_key_words'] = ",".join(key_words_list)
        news_item['ctime'] = item['ctime']
        news_item['content'] = item['content']
        news_item['cate'] = item['cate']
        news_item['url'] = item['url']
        news_item['likes'] = 0
        news_item['collections'] = 0
        news_item['read_num'] = 0
        # Initializing a relatively large heat value will decay over time
        news_item['hot_value'] = 1000 
        
        return news_item

    def update_new_items(self):
        """
        Store the data crawled today into the portrait database
        """
        # Traverse all the data crawled today
        for item in self.sina_collection.find():
            # De duplication according to the title
            if self._find_by_title(self.material_collection, item["title"]):
                continue
            news_item = self._generate_feature_protrail_item(item)
            # Insert material pool
            self.material_collection.insert_one(news_item)
        
        print("run update_new_items success.")

    def update_redis_mongo_protrail_data(self):
        """
        News details need to be updated to redis And the previous day's redis Data deletion
        """
        # Delete the redis display data of the previous day every day, and then write it again
        self.redis_mongo_collection.drop()
        print("delete RedisProtrail ...")
        # Traversal feature library
        for item in self.material_collection.find():
            news_item = dict()
            news_item['news_id'] = item['news_id']
            news_item['title'] = item['title']
            news_item['ctime'] = item['ctime']
            news_item['content'] = item['content']
            news_item['cate'] = item['cate']
            news_item['url'] = item['url']
            news_item['likes'] = 0
            news_item['collections'] = 0
            news_item['read_num'] = 0

            self.redis_mongo_collection.insert_one(news_item)
        print("run update_redis_mongo_protrail_data success.")

    def update_dynamic_feature_protrail(self):
        """
        use redis Dynamic portrait update mongodb Portrait of
        """
        # Traverse the dynamic portrait of redis and update the corresponding dynamic portrait in mongodb        
        news_list = self.news_dynamic_feature_redis.keys()
        for news_key in news_list:
            news_dynamic_info_str = self.news_dynamic_feature_redis.get(news_key)
            news_dynamic_info_str = news_dynamic_info_str.replace("'", '"' ) # Replace single quotation marks with double quotation marks
            news_dynamic_info_dict = json.loads(news_dynamic_info_str)
            
            # Query the corresponding data in mongodb and modify the corresponding image
            news_id = news_key.split(":")[1]
            mongo_info = self.material_collection.find_one({"news_id": news_id})
            new_mongo_info = mongo_info.copy()
            new_mongo_info['likes'] = news_dynamic_info_dict["likes"]
            new_mongo_info['collections'] = news_dynamic_info_dict["collections"]
            new_mongo_info['read_num'] = news_dynamic_info_dict["read_num"]

            self.material_collection.replace_one(mongo_info, new_mongo_info, upsert=True) # If upsert is True, insert without
        print("update_dynamic_feature_protrail success.")


# The script is not executed by the system. The following code is used for testing
if __name__ == "__main__":
    news_protrait = NewsProtraitServer()
    # Update of new material
    news_protrait.update_new_items()
    # Update dynamic features
    news_protrait.update_dynamic_feature_protrail()
    # redis shows the backup of news content
    news_protrait.update_redis_mongo_protrail_data()

2. How to add materials to redis database

After the above content has finished updating the materials, let's introduce how to add the updated materials to the redis database. For the storage of news content in redis, we split the news information into two parts, One part is the attributes of the news that will not change (for example, creation time, title, news content, etc.), and the other part is the dynamic attributes of the material. The identifiers of the key s stored in redis are: static_news_detail:news_id and dynamic_news_detail:news_id. the following is the real content stored in redis

The purpose of this is to be more efficient when changing material dynamic information online in real time. When you need to obtain the detailed information of a news, you need to check the two data and put them together before sending them to the front end for display. The code logic of this part is as follows:

import sys
sys.path.append("../../")
from dao.mongo_server import MongoServer
from dao.redis_server import RedisServer


class NewsRedisServer(object):
    def __init__(self):
        self.rec_list_redis = RedisServer().get_reclist_redis()
        self.static_news_info_redis = RedisServer().get_static_news_info_redis()
        self.dynamic_news_info_redis = RedisServer().get_dynamic_news_info_redis()

        self.redis_mongo_collection = MongoServer().get_redis_mongo_collection()

        # Delete the content in redis of the previous day
        self._flush_redis_db()

    def _flush_redis_db(self):
        """It needs to be deleted every day redis Update the new content of the day
        """
        try:
            self.rec_list_redis.flushall()
        except Exception:
            print("flush redis fail ... ")

    def _get_news_id_list(self):
        """Get all the news in the material library id
        """
        # Get news of all data_ id,
        # Violent access, directly traverse the entire database to get the id of all news
        # TODO should have an optimization method that can only return new through query_ ID field
        news_id_list = []
        for item in self.redis_mongo_collection.find():
            news_id_list.append(item["news_id"])
        return news_id_list

    def _set_info_to_redis(self, redisdb, content):
        """take content Add to specified redis
        """
        try: 
            redisdb.set(*content)
        except Exception:
            print("set content fail".format(content))

    def news_detail_to_redis(self):
        """Store the portrait content to be displayed in redis
        Static invariant features are saved to static_news_info_db_num
        Dynamically changing features are stored in the dynamic_news_info_db_num
        """ 
        news_id_list = self._get_news_id_list()

        for news_id in news_id_list:
            news_item_dict = self.redis_mongo_collection.find_one({"news_id": news_id}) # The returned is a list with a dictionary  
            news_item_dict.pop("_id")

            # Separating dynamic and static attributes
            static_news_info_dict = dict()
            static_news_info_dict['news_id'] = news_item_dict['news_id']
            static_news_info_dict['title'] = news_item_dict['title']
            static_news_info_dict['ctime'] = news_item_dict['ctime']
            static_news_info_dict['content'] = news_item_dict['content']
            static_news_info_dict['cate'] = news_item_dict['cate']
            static_news_info_dict['url'] = news_item_dict['url']
            static_content_tuple = "static_news_detail:" + str(news_id), str(static_news_info_dict)
            self._set_info_to_redis(self.static_news_info_redis, static_content_tuple)

            dynamic_news_info_dict = dict()
            dynamic_news_info_dict['likes'] = news_item_dict['likes']
            dynamic_news_info_dict['collections'] = news_item_dict['collections']
            dynamic_news_info_dict['read_num'] = news_item_dict['read_num']
            dynamic_content_tuple = "dynamic_news_detail:" + str(news_id), str(dynamic_news_info_dict)
            self._set_info_to_redis(self.dynamic_news_info_redis, dynamic_content_tuple)

        print("news detail info are saved in redis db.")


if __name__ == "__main__":
    # Every time this object is created, the previous contents in the database will be deleted
    news_redis_server = NewsRedisServer()
    # Send the latest front-end portrait to redis
    news_redis_server.news_detail_to_redis()

At this position, the update logic of offline material portrait is introduced. Finally, if all the above logic are concatenated with codes, the following codes will be run regularly every day, so that the portrait construction logic on the material side can be worn

from material_process.news_protrait import NewsProtraitServer
from material_process.news_to_redis import NewsRedisServer

def process_material():
    """Material handling function
    """
    # Portrait processing
    protrail_server = NewsProtraitServer()
    # Process the portraits of the latest crawling news and store them in the feature library
    protrail_server.update_new_items()
    # Updating the news dynamic portrait needs to be performed before the content of redis database is cleared
    protrail_server.update_dynamic_feature_protrail()
    # Generate a news portrait displayed on the front end and backup it in mongodb
    protrail_server.update_redis_mongo_protrail_data()

    # The news data is written to redis. Note that when redis data is processed here, all the data of the previous day will be cleared
    news_redis_server = NewsRedisServer()
    # Send the latest front-end portrait to redis
    news_redis_server.news_detail_to_redis()


if __name__ == "__main__":
    process_material() 

2, Construction of user side portrait

2.1 classification of user portrait updates

The update of user portrait is mainly divided into two aspects:

  1. Update of newly registered user profile
  2. Update of old user portrait

2.2 fields of user portrait

Since all registered users are placed in one table (new and old users) in our system, we only need to traverse all users in the registry every time we update the portrait. Before we talk about the specific portrait construction logic, we must first understand the fields contained in the user portrait. The following is directly found from mongo:


As can be seen from the above, it is mainly some tags related to the user's basic information and user history information. For the user's basic attribute characteristics, they can be obtained directly from the registry. For the information related to the user's history reading, it is necessary to count all the news details read, liked and collected in the user's history. In order to get the information related to users' historical interests, we need to save the historical records of users' historical reading, likes and collections. In fact, these information can be obtained from the log information, but here is an engineering matter that needs to be explained first. First look at the figure below, and click the details page of a news for each user.

There is a favorite and favorite at the bottom. The result of this front-end display is the data obtained from the back-end, which means that the back-end needs to maintain a list of user historical clicks and favorite articles. Here we use mysql to store them, mainly for fear that redis is not enough. In fact, these two tables can be used not only for front-end display, but also for analyzing the user's portrait, which has sorted out the user's history and collection for us.

2.3 mysql table of user history reading articles

In addition, as mentioned earlier, we can use articles read by users' history as user portraits for better processing and understanding, We also maintain a mysql table of all articles read by users in history (the core logic of maintaining the table is to run the user log every day and update the user's Historical Reading Records). At this time, we actually have three user behavior tables: user reading, praise and collection. Next, we can directly make specific portraits related to user interests through these three tables. The specific logic is as follows:

import sys
import datetime
from collections import Counter, defaultdict

from sqlalchemy.sql.expression import table
sys.path.append("../../")
from dao.mongo_server import MongoServer
from dao.mysql_server import MysqlServer
from dao.entity.register_user import RegisterUser
from dao.entity.user_read import UserRead
from dao.entity.user_likes import UserLikes
from dao.entity.user_collections import UserCollections


class UserProtrail(object):
    def __init__(self):
        self.user_protrail_collection = MongoServer().get_user_protrail_collection()
        self.material_collection = MongoServer().get_feature_protrail_collection()
        self.register_user_sess = MysqlServer().get_register_user_session()
        self.user_collection_sess = MysqlServer().get_user_collection_session()
        self.user_like_sess = MysqlServer().get_user_like_session()
        self.user_read_sess = MysqlServer().get_user_read_session()

    def _user_info_to_dict(self, user):
        """
        take mysql The query results are converted into dictionary storage
        """
        info_dict = dict()
        
        # Basic attribute characteristics
        info_dict["userid"] = user.userid
        info_dict["username"] = user.username
        info_dict["passwd"] = user.passwd
        info_dict["gender"] = user.gender
        info_dict["age"] = user.age
        info_dict["city"] = user.city

        # hobby 
        behaviors=["like","collection"]
        time_range = 15
        _, feature_dict = self.get_statistical_feature_from_history_behavior(user.userid,time_range,behavior_types=behaviors)
        for type in feature_dict.keys():
            if feature_dict[type]:
                info_dict["{}_{}_intr_cate".format(type,time_range)] = feature_dict[type]["intr_cate"]  # Top 3 news categories most popular in history
                info_dict["{}_{}_intr_key_words".format(type,time_range)] = feature_dict[type]["intr_key_words"] # History likes the top 3 keywords of news
                info_dict["{}_{}_avg_hot_value".format(type,time_range)] = feature_dict[type]["avg_hot_value"] # Average popularity of news users like
                info_dict["{}_{}_news_num".format(type,time_range)] = feature_dict[type]["news_num"] # Number of news users like in 15 days
            else:
                info_dict["{}_{}_intr_cate".format(type,time_range)] = ""  # Top 3 news categories most popular in history
                info_dict["{}_{}_intr_key_words".format(type,time_range)] = "" # History likes the top 3 keywords of news
                info_dict["{}_{}_avg_hot_value".format(type,time_range)] = 0 # Average popularity of news users like
                info_dict["{}_{}_news_num".format(type,time_range)] = 0 # Number of news users like in 15 days

        return info_dict

    def update_user_protrail_from_register_table(self):
        """
        Users registered on that day need to be added to the user portrait pool every day
        """
        # Traverse the registered user table
        for user in self.register_user_sess.query(RegisterUser).all():
            user_info_dict = self._user_info_to_dict(user)
            old_user_protrail_dict = self.user_protrail_collection.find_one({"username": user.username})
            if old_user_protrail_dict is None:
                self.user_protrail_collection.insert_one(user_info_dict)
            else:
                # Use the parameter upsert to set to true. For those without, a
                # replace_ If you encounter the same_ The id will be updated
                self.user_protrail_collection.replace_one(old_user_protrail_dict, user_info_dict, upsert=True)
            

    def get_statistical_feature_from_history_behavior(self, user_id, time_range, behavior_types):
        """Get the statistical characteristics of user's historical behavior ["read","like","collection"] """
        fail_type = []
        sess, table_obj, history = None, None, None
        feature_dict = defaultdict(dict)

        end = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        start = (datetime.datetime.now()+datetime.timedelta(days=-time_range)).strftime("%Y-%m-%d %H:%M:%S")

        for type in behavior_types:
            if type == "read":
                sess = getattr(self,"user_{}_sess".format(type))
                table_obj = UserRead
            elif type == "like":
                sess = getattr(self,"user_{}_sess".format(type))
                table_obj = UserLikes
            elif type == "collection":
                sess = getattr(self,"user_{}_sess".format(type))
                table_obj = UserCollections
            try:
                history = sess.query(table_obj).filter(table_obj.userid==user_id).filter(table_obj.curtime>=start).filter(table_obj.curtime<=end).all()
            except Exception as e:
                print(str(e))
                fail_type.append(type)
                continue
            
            feature_dict[type] = self._gen_statistical_feature(history)
            
        return fail_type, feature_dict
          
    def _gen_statistical_feature(self,history):
        """"""
        # Get features for history
        if not len(history): return None
        history_new_id = []
        history_hot_value = []
        history_new_cate = []
        history_key_word = []
        for h in history:
            news_id = h.newid 
            newsquery = {"news_id":news_id}
            result = self.material_collection.find_one(newsquery)
            history_new_id.append(result["news_id"])
            history_hot_value.append(result["hot_value"])
            history_new_cate.append(result["cate"])
            history_key_word += result["manual_key_words"].split(",")
        
        feature_dict = dict()
        # Calculate average heat
        feature_dict["avg_hot_value"] = 0 if sum(history_hot_value) < 0.001 else sum(history_hot_value) / len(history_hot_value)

        # Calculate top 3 categories
        cate_dict = Counter(history_new_cate)
        cate_list= sorted(cate_dict.items(),key = lambda d: d[1], reverse=True)
        cate_str = ",".join([item[0] for item in cate_list[:3]] if len(cate_list)>=3 else [item[0] for item in cate_list] )
        feature_dict["intr_cate"] = cate_str

        # Calculate the keywords of Top3
        word_dict = Counter(history_key_word)
        word_list= sorted(word_dict.items(),key = lambda d: d[1], reverse=True)
        # TODO keyword belongs to the long tail. If the number of keywords is once, how to go to the top 3
        word_str = ",".join([item[0] for item in word_list[:3]] if len(cate_list)>=3 else [item[0] for item in word_list] )
        feature_dict["intr_key_words"] = word_str
        # Number of news
        feature_dict["news_num"] = len(history_new_id)

        return feature_dict


if __name__ == "__main__":
    user_protrail = UserProtrail().update_user_protrail_from_register_table()

Here is the basic logic of user portrait. The following is the overall logic code of user portrait update:

from user_process.user_to_mysql import UserMysqlServer
from user_process.user_protrail import UserProtrail

"""
1. Convert the user's exposure data from redis Fall mysql Yes.
2. Update user portrait
"""
    
def process_users():
    """Drop user data Mysql
    """
    # User mysql storage
    user_mysql_server = UserMysqlServer()
    # User exposure data falls into mysql
    user_mysql_server.user_exposure_to_mysql()

    # Update user portrait
    user_protrail = UserProtrail()
    user_protrail.update_user_protrail_from_register_table()


if __name__ == "__main__":
    process_users() 

3, Automatic construction of portrait

For the construction of user side portrait and material side portrait in (I) and (II) above, I have a basic understanding of the material folder, combined with the following mind map of automatic construction of portrait:

The above describes the user side and material side image construction respectively. The next step is to automate all the above processes and set scheduled tasks. In fact, the most core point is to complete the construction of user and material images before clearing redis data. The following is the whole automated process.

3.1 material update script: process_material.py

from material_process.news_protrait import NewsProtraitServer
from material_process.news_to_redis import NewsRedisServer


def process_material():
    """Material handling function
    """
    # Portrait processing
    protrail_server = NewsProtraitServer()
    # Process the portraits of the latest crawling news and store them in the feature library
    protrail_server.update_new_items()
    # Updating the news dynamic portrait needs to be performed before the content of redis database is cleared
    protrail_server.update_dynamic_feature_protrail()
    # Generate a news portrait displayed on the front end and backup it in mongodb
    protrail_server.update_redis_mongo_protrail_data()


if __name__ == "__main__":
    process_material() 

Successfully updated material profile:

run update_new_items success.
update_dynamic_feature_protrail success.
delete RedisProtrail ...
run update_redis_mongo_protrail_data success.

[reminder]
Update the \ fun rec master \ codes \ news in the run_ recsys\news_ rec_ server\materials\process_ material. Py, if redis has a password set, it needs to be in \ fun rec master \ codes \ news_ recsys\news_ rec_ server\dao\redis_ server. Py RedisServer class__ init__ And_ redis_db plus password parameters.

Corresponding Dao_ config. password should also be added to the redis part of py.

3.2 user portrait update script: process_user.py

from user_process.user_to_mysql import UserMysqlServer
from user_process.user_protrail import UserProtrail

"""
1. Convert the user's exposure data from redis Fall mysql Yes.
2. Update user portrait
"""

    
def process_users():
    """Drop user data Mysql
    """
    # User mysql storage
    user_mysql_server = UserMysqlServer()
    # User exposure data falls into mysql
    user_mysql_server.user_exposure_to_mysql()

    # Update user portrait
    user_protrail = UserProtrail()
    user_protrail.update_user_protrail_from_register_table()


if __name__ == "__main__":
    process_users() 

Note: the userinfo library needs to be built manually, and the table does not need to be built manually (create_all in the code will automatically build the table structure).

PS: if an error is reported: No module named 'sqlalchemy', because it is in news_rec_server\dao\mysql_server.py needs sqlalchemy (which seems to be a component of flash), so you need to download install flash sqlalchemy. You can first test whether sqlalchemy can normally create a table and insert a row of data in the environment (the basic usage of sqlalchemy will be introduced in the [front and back interaction] of the next task).

3.3 redis data update script: update_redis.py

from material_process.news_protrait import NewsProtraitServer
from material_process.news_to_redis import NewsRedisServer


def update():
    """
    Material handling function
    """
    # The news data is written to redis. Note that when redis data is processed here, all the data of the previous day will be cleared
    news_redis_server = NewsRedisServer()
    # Send the latest front-end portrait to redis
    news_redis_server.news_detail_to_redis()


if __name__ == "__main__":
    update() 

3.4 shell script: connect the above three scripts in series

Finally, the shell script offline that puts the above three scripts together_ material_ and_ user_ process. sh:

#!/bin/bash

python=/home/recsys/miniconda3/envs/news_rec_py3/bin/python
news_recsys_path="/home/recsys/news_rec_server"

echo "$(date -d today +%Y-%m-%d-%H-%M-%S)"

# In order to handle the path problem more conveniently, you can directly cd to the directory where we want to run
cd ${news_recsys_path}/materials

# Update material portrait
${python} process_material.py
if [ $? -eq 0 ]; then
    echo "process_material success."
else   
    echo "process_material fail."
fi 

# Update user portrait
${python} process_user.py
if [ $? -eq 0 ]; then
    echo "process_user.py success."
else   
    echo "process_user.py fail."
fi

# Clear the data in redis of the previous day and update the latest data of today
${python} update_redis.py
if [ $? -eq 0 ]; then
    echo "update_redis success."
else   
    echo "update_redis fail."
fi


echo " "

crontab scheduled tasks:

Disassemble the scheduled task:

0 0 * * * /home/recsys/news_rec_server/scheduler/crawl_news.sh >> /home/recsys/news_rec_server/logs/offline_material_process.log && 

/home/recsys/news_rec_server/scheduler/offline_material_and_user_process.sh >> /home/recsys/news_rec_server/logs/material_and_user_process.log && 

/home/recsys/news_rec_server/scheduler/run_offline.sh >> /home/recsys/news_rec_server/logs/offline_rec_list_to_redis.log

The crontab syntax above means that the following scripts are run at 0:00 every day. The & & in the above command means that the contents in front of the symbol are run first and then the following commands are run. Therefore, & & here is to connect the above three tasks in series. The general execution logic is as follows:

  • First crawl the news data. It should be noted here that although it is today's zero point crawl data, it is actually the news of the previous day
  • After the data is crawled, the user portrait is updated offline, and the material portrait and online portrait are stored in redis
  • Finally, it's actually an offline recommendation process. You can save the user's sorted list in redis offline and get it online directly

Reference

(1)datawhale notebook
(2) Notes of leader sharp: https://relph1119.github.io/my-team-learning/#/recommender_system32/task03
(3)Classmate Jun's notes

Topics: Python