Recently, the framework of the crawler writing microblog has been basically stable, but it has been stuck in the process of analyzing the meaning of each field for several days. Because it is unclear about the meaning of each field, the api comments on the official website seem a little outdated. Many fields have no comments, so I can only analyze them bit by bit
The microblog data obtained by the mobile terminal is in json format. After obtaining the data of a page, set it to data, then data['cards'][0]['card_group'] You can get an array. Each element in the array is a line of microblog, which contains the publishing time, microblog content, publishing users, reprinted content, etc. Specific fields include:
'idstr', #It is equivalent to id and str 'id', #Information id 'created_timestamp', #Create timestamp ex:1448617509 'created_at', #Creation time. However, it should be noted that if it is the data before this year, #The display format is' year month day hour: Min: sec '. #This year's data is displayed in 'month day hour: Min: sec' format 'attitudes_count', #Number of likes 'reposts_count', #Forwarding number 'comments_count', #Number of comments 'isLongText', #Is it a long microblog (for now, it's all False) 'source', #User client (iphone, etc.) 'pid', #Unknown, but it is necessary to save it. It may not have a value 'bid', #Unknown, but it is necessary to save it. It may not have a value # Picture information-------------------------------------------------------------- 'original_pic', #Picture related, original picture address 'bmiddle_pic', #Picture address, and original_ Compared with pic, it just transposes large bmiddle 'thumbnail_pic', #The address seems to be equal to the picture address in pic, and the size is thumb 'pic_ids', #The picture id is an Array 'pics', #If a graph is included, it is an array with an embedded dictionary, # Including size,pid,geo,url, etc # The following fields contain array or dictionary format and need further processing---------------------------------- 'retweeted_status', #Reprint information. If there is reprinted information in the microblog, it contains this item. The fields of reprinted items are consistent with this microblog 'user', #User information, dictionary format. Where ['uid '] and ['name'] represent the user's id and name respectively 'page_info', #Information about links embedded in the page. Such as external chain, articles, videos, geographic information topics, etc. 'topic_struct', #Topic information is an array format and connotation dictionary, including 'topic'_ Title 'item. 'text', #Text message. Connotation, expression, external chain, reply, image, topic and other information.
Here we will talk about the processing of text items. Since the text item is actually a piece of html code, it can also be analyzed with web page analysis packages (such as python's beautiful soup, Java's jsoop, etc.), but on the one hand, it is unnecessary and slow, and on the other hand, it is troublesome to rely on packages when configuring clients on new virtual machines, so regular expressions are used for analysis
I transferred a text randomly. It's like this
#Music afternoon tea# I don't feel the dream of spring grass in the pond, and the autumn sound of Indus leaves in front of the steps. The rain has damaged the urn, the new moss is green, and the leaves are red in autumn. Late at night, the wind and bamboo knock on the rhyme of autumn, thousands of leaves and thousands of voices are hate. Everyone explains the sad autumn events, unlike a poet who knows them thoroughly...video
In the above text, there is a topic link (# music afternoon tea #), an external link (video) and text (I don't feel the dream of spring grass in the pond, the autumn sound of Wu leaves in front of the steps. The rain destroys the new moss and green in the urn, and the leaves are red in the horizontal forest in autumn. The wind and bamboo knock on the autumn rhyme in the deep night, and thousands of leaves and thousands of voices hate. Everyone explains the sad autumn things, not like the poet knows them completely...)
The regular expression for each item is as follows
#The expression may also contain something #topic of conversation #User link reply.+?// #Reply \[.+?\] #expression #Generally speaking, it means external chain, video, web page, etc #image
class parseMicroblogPage(): def __init__(self): self.p_face=re.compile(r'\[.+?\]') self.p_face_i=re.compile(r'') self.p_user=re.compile(r'') self.p_topic=re.compile(r'') self.p_reply=re.compile(r'reply.+?//') self.p_link=re.compile(r'') self.p_img=re.compile(r'') self.p_span=re.compile(r'') self.p_http_png=re.compile(r'http://.+?png') def parse_blog_page(self,data): try: # check if the page is json type data=json.loads(data) except: save_page(data) raise ValueError('Unable to parse page') try: # check if the page is empty mod_type=data['cards'][0]['mod_type'] except: save_page(json.dumps(data)) raise ValueError('The type of this page is incorrect') if 'empty' in mod_type: raise ValueError('This page is empty') try: # get card group as new data data=data['cards'][0]['card_group'] except: save_page(json.dumps(data)) raise ValueError('The type of this page is incorrect') data_list=[] for block in data: res=self.parse_card_group(block) data_list.append(res) return data_list def parse_card_group(self,data): data=data['mblog'] msg=self.parse_card_inner(data) return msg def parse_card_inner(self,data): msg={} keys=list(data.keys()) key_array=[ # Basic information-------------------------------------------------------------- 'idstr', #It is equivalent to id and str 'id', #Information id 'created_timestamp', #Create timestamp ex:1448617509 'attitudes_count', #Number of likes 'reposts_count', #Forwarding number 'comments_count', #Number of comments 'isLongText', #Is it a long microblog (for now, it's all False) 'source', #User client (iphone, etc.) 'pid', #Unknown, but it is necessary to save it. It may not have a value 'bid', #Unknown, but it is necessary to save it. It may not have a value # Picture information-------------------------------------------------------------- 'original_pic', #Picture related, original picture address 'bmiddle_pic', #Picture address, and original_ Compared with pic, it just transposes large bmiddle 'thumbnail_pic', #The address seems to be equal to the picture address in pic, and the size is thumb 'pic_ids', #The picture id is an Array 'pics', #If a graph is included, it is an array with an embedded dictionary, # Including size,pid,geo,url, etc ] for item in keys: if item in key_array: msg[item]=data[item] #Blend ID, mid, MSG_ id if 'id' not in keys: if 'mid' in keys: msg['id']=data['mid'] elif 'msg_id' in keys: msg['id']=data['msg_id'] if 'attitudes_count' not in keys and 'like_count' in keys: msg['attitudes_count']=data['like_count'] # created_at if 'created_at' in keys: if data['created_at'].__len__()>14: msg['created_at']=data['created_at'] else: if 'created_timestamp' in keys: stamp=data['created_timestamp'] x=time.localtime(stamp) str_time=time.strftime('%Y-%m-%d %H:%M',x) msg['created_at']=str_time else: msg['created_at']=config.CURRENT_YEAR+'-'+data['created_at'] # retweeted_status if 'retweeted_status' in keys: msg['retweeted_status']=self.parse_card_inner(data['retweeted_status']) msg['is_retweeted']=True else: msg['is_retweeted']=False # user if 'user' in keys: msg['user']=self.parse_user_info(data['user']) msg['user_id']=msg['user']['uid'] msg['user_name']=msg['user']['name'] # url_struct # msg['url_struct']=self.parse_url_struct(data['url_struct']) # page_info if 'page_info' in keys: msg['page_info']=self.parse_page_info(data['page_info']) # topic_struct if 'topic_struct' in keys: msg['topic_struct']=self.parse_topic_struct(data['topic_struct']) # text if 'text' in keys: msg['ori_text']=data['text'] msg['dealed_text']=self.parse_text(data['text']) return msg def parse_user_info(self,user_data): keys=user_data.keys() user={} if 'id' in keys: user['uid']=str(user_data['id']) if 'screen_name' in keys: user['name']=user_data['screen_name'] if 'description' in keys: user['description']=user_data['description'] if 'fansNum' in keys: temp=user_data['fansNum'] if isinstance(temp,str): temp=int(temp.replace('ten thousand','0000')) user['fans_num']=temp if 'gender' in keys: if user_data['gender']=='m': user['gender']='male' if user_data['gender']=='f': user['gender']='female' if 'profile_url' in keys: user['basic_page']='http://m.weibo.cn'+user_data['profile_url'] if 'verified' in keys: user['verified']=user_data['verified'] if 'verified_reason' in keys: user['verified_reason']=user_data['verified_reason'] if 'statuses_count' in keys: temp=user_data['statuses_count'] if isinstance(temp,str): temp=int(temp.replace('ten thousand','0000')) user['blog_num']=temp return user def parse_text(self,text): msg={} # data-url data_url=re.findall(self.p_link,text) if data_url.__len__()>0: data_url_list=[] for block in data_url: temp=self.parse_text_data_url(block) data_url_list.append(temp) msg['data_url']=data_url_list text=re.sub(self.p_link,'',text) # topic topic=re.findall(self.p_topic,text) if topic.__len__()>0: topic_list=[] for block in topic: temp=self.parse_text_topic(block) topic_list.append(temp) msg['topic']=topic_list text=re.sub(self.p_topic,'',text) # moiton motion=[] res1=re.findall(self.p_face_i,text) for item in res1: temp=re.findall(self.p_face,item)[0] motion.append(temp) text=re.sub(self.p_face_i,'',text) res2=re.findall(self.p_face,text) motion=motion+res2 if motion.__len__()>0: msg['motion']=motion text=re.sub(self.p_face,'',text) # user user=[] user_res=re.findall(self.p_user,text) if user_res.__len__()>0: for item in user_res: temp=self.parse_text_user(item) user.append(temp) msg['user']=user text=re.sub(self.p_user,'@',text) msg['left_content']=text.split('//') return msg def parse_text_data_url(self,text): link_data={} link_data['type']='data_url' try: res_face=re.findall(self.p_face_i,text)[0] res_img=re.findall(self.p_img,res_face)[0] res_http=re.findall(self.p_http_png,res_img)[0] link_data['img']=res_http res_class=re.findall(r'',text)[0] link_data['class']=res_class text=re.sub(self.p_face_i,'',text) except: pass try: res_span=re.findall(self.p_span,text)[0] title=re.findall(r'>.+?<',res_span)[0][1:-1] link_data['title']=title text=re.sub(self.p_span,'',text) except: pass try: data_url=re.findall(r'data-url=".+?"',text)[0] data_url=re.findall(r'".+?"',data_url)[0][1:-1] link_data['short_url']=data_url url=re.findall(r'href=".+?"',text)[0][6:-1] link_data['url']=url except: pass # print(text) # print(json.dumps(link_data,indent=4)) return link_data def parse_text_topic(self,text): data={} try: data['type']='topic' data['class']=re.findall(r'class=".+?"',text)[0][7:-1] data['title']=re.findall(r'>.+?<',text)[0][1:-1] data['url']='http://m.weibo.cn'+re.findall(r'href=".+?"',text)[0][6:-1] except: pass return data def parse_text_user(self,text): data={} data['type']='user' try: data['title']=re.findall(r'>.+?<',text)[0][2:-1] data['url']= 'http://m.weibo.cn'+re.findall(r'href=".+?"',text)[0][6:-1] except: pass return data def parse_url_struct(self,data): url_struct=[] for block in data: keys=block.keys() new_block=block url_struct.append(new_block) return url_struct def parse_page_info(self,data): keys=data.keys() key_array=[ 'page_url', 'page_id', 'content2', 'tips', 'page_pic', 'page_desc', 'object_type', 'page_title', 'content1', 'type', 'object_id' ] msg={} for item in keys: if item in key_array: msg[item]=data[item] return msg def parse_topic_struct(self,data): msg=[] for block in data: keys=block.keys() temp=block if 'topic_title' in keys: temp['topic_url']='http://m.weibo.cn/k/{topic}?from=feed'\ .format(topic=block['topic_title']) msg.append(temp) return msg