Meaning of microblog data fields

Posted by allyse on Mon, 22 Nov 2021 18:55:34 +0100

Recently, the framework of the crawler writing microblog has been basically stable, but it has been stuck in the process of analyzing the meaning of each field for several days. Because it is unclear about the meaning of each field, the api comments on the official website seem a little outdated. Many fields have no comments, so I can only analyze them bit by bit

The microblog data obtained by the mobile terminal is in json format. After obtaining the data of a page, set it to data, then data['cards'][0]['card_group'] You can get an array. Each element in the array is a line of microblog, which contains the publishing time, microblog content, publishing users, reprinted content, etc. Specific fields include:

'idstr',                    #It is equivalent to id and str
'id',                       #Information id
'created_timestamp',        #Create timestamp ex:1448617509
'created_at',               #Creation time. However, it should be noted that if it is the data before this year,
                            #The display format is' year month day hour: Min: sec '.
                            #This year's data is displayed in 'month day hour: Min: sec' format
'attitudes_count',          #Number of likes
'reposts_count',            #Forwarding number
'comments_count',           #Number of comments
'isLongText',               #Is it a long microblog (for now, it's all False)
'source',                   #User client (iphone, etc.)
'pid',                      #Unknown, but it is necessary to save it. It may not have a value
'bid',                      #Unknown, but it is necessary to save it. It may not have a value

# Picture information--------------------------------------------------------------
'original_pic',             #Picture related, original picture address
'bmiddle_pic',              #Picture address, and original_ Compared with pic, it just transposes large bmiddle
'thumbnail_pic',            #The address seems to be equal to the picture address in pic, and the size is thumb
'pic_ids',                  #The picture id is an Array
'pics',                     #If a graph is included, it is an array with an embedded dictionary,
                            # Including size,pid,geo,url, etc

# The following fields contain array or dictionary format and need further processing----------------------------------
'retweeted_status',         #Reprint information. If there is reprinted information in the microblog, it contains this item. The fields of reprinted items are consistent with this microblog
'user',                     #User information, dictionary format. Where ['uid '] and ['name'] represent the user's id and name respectively
'page_info',                #Information about links embedded in the page. Such as external chain, articles, videos, geographic information topics, etc.
'topic_struct',             #Topic information is an array format and connotation dictionary, including 'topic'_ Title 'item.
'text',                     #Text message. Connotation, expression, external chain, reply, image, topic and other information.

Here we will talk about the processing of text items. Since the text item is actually a piece of html code, it can also be analyzed with web page analysis packages (such as python's beautiful soup, Java's jsoop, etc.), but on the one hand, it is unnecessary and slow, and on the other hand, it is troublesome to rely on packages when configuring clients on new virtual machines, so regular expressions are used for analysis

I transferred a text randomly. It's like this

#Music afternoon tea#
I don't feel the dream of spring grass in the pond, and the autumn sound of Indus leaves in front of the steps. The rain has damaged the urn, the new moss is green, and the leaves are red in autumn. Late at night, the wind and bamboo knock on the rhyme of autumn, thousands of leaves and thousands of voices are hate. Everyone explains the sad autumn events, unlike a poet who knows them thoroughly...video

In the above text, there is a topic link (# music afternoon tea #), an external link (video) and text (I don't feel the dream of spring grass in the pond, the autumn sound of Wu leaves in front of the steps. The rain destroys the new moss and green in the urn, and the leaves are red in the horizontal forest in autumn. The wind and bamboo knock on the autumn rhyme in the deep night, and thousands of leaves and thousands of voices hate. Everyone explains the sad autumn things, not like the poet knows them completely...)

The regular expression for each item is as follows

           #The expression may also contain something
 #topic of conversation         
      #User link               
reply.+?//            #Reply              
\[.+?\]             #expression             
  #Generally speaking, it means external chain, video, web page, etc
            #image

The following is the corresponding python code, which is contained in the parseMicroblogPage class. When you get the page data, call the parse_ in it. blog_ The page function will return an array containing the processed microblog data

class parseMicroblogPage():

    def __init__(self):
        self.p_face=re.compile(r'\[.+?\]')
        self.p_face_i=re.compile(r'')
        self.p_user=re.compile(r'')
        self.p_topic=re.compile(r'')
        self.p_reply=re.compile(r'reply.+?//')
        self.p_link=re.compile(r'')
        self.p_img=re.compile(r'')
        self.p_span=re.compile(r'')
        self.p_http_png=re.compile(r'http://.+?png')

    def parse_blog_page(self,data):
        try:        # check if the page is json type
            data=json.loads(data)
        except:
            save_page(data)
            raise ValueError('Unable to parse page')

        try:        # check if the page is empty
            mod_type=data['cards'][0]['mod_type']
        except:
            save_page(json.dumps(data))
            raise ValueError('The type of this page is incorrect')

        if 'empty' in mod_type:
            raise ValueError('This page is empty')

        try:        # get card group as new data
            data=data['cards'][0]['card_group']
        except:
            save_page(json.dumps(data))
            raise ValueError('The type of this page is incorrect')

        data_list=[]
        for block in data:
            res=self.parse_card_group(block)
            data_list.append(res)

        return data_list

    def parse_card_group(self,data):
        data=data['mblog']
        msg=self.parse_card_inner(data)
        return msg

    def parse_card_inner(self,data):
        msg={}
        keys=list(data.keys())

        key_array=[
            # Basic information--------------------------------------------------------------
            'idstr',                        #It is equivalent to id and str
            'id',                           #Information id
            'created_timestamp',          #Create timestamp ex:1448617509
            'attitudes_count',            #Number of likes
            'reposts_count',              #Forwarding number
            'comments_count',             #Number of comments
            'isLongText',                  #Is it a long microblog (for now, it's all False)
            'source',                      #User client (iphone, etc.)
            'pid',                         #Unknown, but it is necessary to save it. It may not have a value
            'bid',                         #Unknown, but it is necessary to save it. It may not have a value
            # Picture information--------------------------------------------------------------
            'original_pic',               #Picture related, original picture address
            'bmiddle_pic',                #Picture address, and original_ Compared with pic, it just transposes large bmiddle
            'thumbnail_pic',              #The address seems to be equal to the picture address in pic, and the size is thumb
            'pic_ids',                     #The picture id is an Array
            'pics',                        #If a graph is included, it is an array with an embedded dictionary,
                                            # Including size,pid,geo,url, etc
        ]

        for item in keys:
            if item in key_array:
                msg[item]=data[item]

        #Blend ID, mid, MSG_ id
        if 'id' not in keys:
            if 'mid' in keys:
                msg['id']=data['mid']
            elif 'msg_id' in keys:
                msg['id']=data['msg_id']

        if 'attitudes_count' not in keys and 'like_count' in keys:
            msg['attitudes_count']=data['like_count']

        # created_at
        if 'created_at' in keys:
            if data['created_at'].__len__()>14:
                msg['created_at']=data['created_at']
            else:
                if 'created_timestamp' in keys:
                    stamp=data['created_timestamp']
                    x=time.localtime(stamp)
                    str_time=time.strftime('%Y-%m-%d %H:%M',x)
                    msg['created_at']=str_time
                else:
                    msg['created_at']=config.CURRENT_YEAR+'-'+data['created_at']

        # retweeted_status
        if 'retweeted_status' in keys:
            msg['retweeted_status']=self.parse_card_inner(data['retweeted_status'])
            msg['is_retweeted']=True
        else:
            msg['is_retweeted']=False

        # user
        if 'user' in keys:
            msg['user']=self.parse_user_info(data['user'])
            msg['user_id']=msg['user']['uid']
            msg['user_name']=msg['user']['name']

        # url_struct
        # msg['url_struct']=self.parse_url_struct(data['url_struct'])

        # page_info
        if 'page_info' in keys:
            msg['page_info']=self.parse_page_info(data['page_info'])

        # topic_struct
        if 'topic_struct' in keys:
            msg['topic_struct']=self.parse_topic_struct(data['topic_struct'])

        # text
        if 'text' in keys:
            msg['ori_text']=data['text']
            msg['dealed_text']=self.parse_text(data['text'])


        return msg

    def parse_user_info(self,user_data):
        keys=user_data.keys()
        user={}
        if 'id' in keys:
            user['uid']=str(user_data['id'])
        if 'screen_name' in keys:
            user['name']=user_data['screen_name']
        if 'description' in keys:
            user['description']=user_data['description']
        if 'fansNum' in keys:
            temp=user_data['fansNum']
            if isinstance(temp,str):
                temp=int(temp.replace('ten thousand','0000'))
            user['fans_num']=temp
        if 'gender' in keys:
            if user_data['gender']=='m':
                user['gender']='male'
            if user_data['gender']=='f':
                user['gender']='female'
        if 'profile_url' in keys:
            user['basic_page']='http://m.weibo.cn'+user_data['profile_url']
        if 'verified' in keys:
            user['verified']=user_data['verified']
        if 'verified_reason' in keys:
            user['verified_reason']=user_data['verified_reason']
        if 'statuses_count' in keys:
            temp=user_data['statuses_count']
            if isinstance(temp,str):
                temp=int(temp.replace('ten thousand','0000'))
            user['blog_num']=temp
        return user

    def parse_text(self,text):
        msg={}

        # data-url
        data_url=re.findall(self.p_link,text)
        if data_url.__len__()>0:
            data_url_list=[]
            for block in data_url:
                temp=self.parse_text_data_url(block)
                data_url_list.append(temp)
            msg['data_url']=data_url_list
        text=re.sub(self.p_link,'',text)

        # topic
        topic=re.findall(self.p_topic,text)
        if topic.__len__()>0:
            topic_list=[]
            for block in topic:
                temp=self.parse_text_topic(block)
                topic_list.append(temp)
            msg['topic']=topic_list
        text=re.sub(self.p_topic,'',text)

        # moiton
        motion=[]
        res1=re.findall(self.p_face_i,text)
        for item in res1:
            temp=re.findall(self.p_face,item)[0]
            motion.append(temp)
        text=re.sub(self.p_face_i,'',text)

        res2=re.findall(self.p_face,text)
        motion=motion+res2
        if motion.__len__()>0:
            msg['motion']=motion
        text=re.sub(self.p_face,'',text)

        # user
        user=[]
        user_res=re.findall(self.p_user,text)
        if user_res.__len__()>0:
            for item in user_res:
                temp=self.parse_text_user(item)
                user.append(temp)
            msg['user']=user
        text=re.sub(self.p_user,'@',text)

        msg['left_content']=text.split('//')
        return msg

    def parse_text_data_url(self,text):
        link_data={}
        link_data['type']='data_url'

        try:
            res_face=re.findall(self.p_face_i,text)[0]
            res_img=re.findall(self.p_img,res_face)[0]
            res_http=re.findall(self.p_http_png,res_img)[0]
            link_data['img']=res_http

            res_class=re.findall(r'',text)[0]
            link_data['class']=res_class

            text=re.sub(self.p_face_i,'',text)
        except:
            pass

        try:
            res_span=re.findall(self.p_span,text)[0]
            title=re.findall(r'>.+?<',res_span)[0][1:-1]
            link_data['title']=title
            text=re.sub(self.p_span,'',text)
        except:
            pass

        try:
            data_url=re.findall(r'data-url=".+?"',text)[0]
            data_url=re.findall(r'".+?"',data_url)[0][1:-1]
            link_data['short_url']=data_url

            url=re.findall(r'href=".+?"',text)[0][6:-1]
            link_data['url']=url
        except:
            pass

        # print(text)
        # print(json.dumps(link_data,indent=4))
        return link_data

    def parse_text_topic(self,text):
        data={}

        try:
            data['type']='topic'
            data['class']=re.findall(r'class=".+?"',text)[0][7:-1]
            data['title']=re.findall(r'>.+?<',text)[0][1:-1]
            data['url']='http://m.weibo.cn'+re.findall(r'href=".+?"',text)[0][6:-1]
        except:
            pass

        return data

    def parse_text_user(self,text):
        data={}
        data['type']='user'

        try:
            data['title']=re.findall(r'>.+?<',text)[0][2:-1]
            data['url']= 'http://m.weibo.cn'+re.findall(r'href=".+?"',text)[0][6:-1]
        except:
            pass

        return data

    def parse_url_struct(self,data):
        url_struct=[]
        for block in data:
            keys=block.keys()
            new_block=block
            url_struct.append(new_block)
        return url_struct

    def parse_page_info(self,data):
        keys=data.keys()
        key_array=[
            'page_url',
            'page_id',
            'content2',
            'tips',
            'page_pic',
            'page_desc',
            'object_type',
            'page_title',
            'content1',
            'type',
            'object_id'
        ]
        msg={}
        for item in keys:
            if item in key_array:
                msg[item]=data[item]
        return msg

    def parse_topic_struct(self,data):
        msg=[]
        for block in data:
            keys=block.keys()
            temp=block
            if 'topic_title' in keys:
                temp['topic_url']='http://m.weibo.cn/k/{topic}?from=feed'\
                    .format(topic=block['topic_title'])
            msg.append(temp)
        return msg