Crawl at will! Super open source Python crawler toolbox

Posted by phpwizard01 on Thu, 27 Jan 2022 15:42:42 +0100

Recently, a domestic developer opened a crawler toolbox with many data sources on GitHub - InfoSpider, which became popular accidentally!!!

How hot is it? Within a few days, open source ranked fourth in GitHub's weekly list, with a standard star of 1.3K and a total of 172 branches. At the same time, the author has opened all the project codes and use documents, and there are video explanations on station B.

Project code: https://github.com/kangvcar/InfoSpider Project documents: https://infospider.vercel.app Project video presentation: https://www.bilibili.com/video/BV14f4y1R7oF/

In such an era of information explosion, everyone has many accounts. When there are many accounts, there will be such a situation: personal data is scattered among various companies, which will form a data island, and multidimensional data cannot be integrated. This project can help you integrate multidimensional data and analyze personal data, so that you can be more intuitive Learn more about yourself.

InfoSpider is a crawler toolbox that integrates many data sources. It aims to help users get back their data safely and quickly. The tool code is open-source and the process is transparent. It also provides data analysis function and generates chart files based on user data, so that users can have a more intuitive and in-depth understanding of their own information.

At present, it supports data sources including GitHub, QQ mailbox, NetEase mail box, Ali mailbox, Sina mailbox, Hotmail mailbox, Outlook mailbox, Jingdong, Taobao, Alipay, China Mobile, China Unicom, China Telecom, know, beep, Hotmail, QQ, group of friends, album of friends, browser history, 12306, blog, CSDN blog, Open source Chinese blogs and simple books.

According to the creator, InfoSpider has the following features:

Safe and reliable: this project is an open source project. The code is simple, all source codes can be seen, local operation, safe and reliable.
Easy to use: provide GUI interface, just click the data source to be obtained and operate according to the prompt.
Clear structure: all data sources of the project are independent of each other and have high portability. All crawler scripts are under the Spiders file of the project.
Rich data sources: the project currently supports up to 24 + data sources, which are continuously updated.
Unified data format: all data crawled will be stored in json format to facilitate later data analysis.
Rich personal data: this project will crawl as much personal data as possible for you, and the later data processing can be deleted as needed.
Data analysis: this project provides visual analysis of personal data, which is only partially supported at present.

InfoSpider is also very simple to use. You only need to install Python 3 and Chrome browser and run Python 3 main Py, click the data source button in the open window, select the data saving path according to the prompt, and then enter the account password, the data will be automatically crawled, and the crawled data can be viewed according to the downloaded directory.

Of course, if you want to practice and learn crawlers by yourself, the author also open source all the crawler code, which is very suitable for actual combat.

For example, climbing taobao:

import json
import random
import time
import sys
import os
import requests
import numpy as np
import math
from lxml import etree
from pyquery import PyQuery as pq
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver import ChromeOptions, ActionChains
from tkinter.filedialog import askdirectory
from tqdm import trange


def ease_out_quad(x):
    return 1 - (1 - x) * (1 - x)

def ease_out_quart(x):
    return 1 - pow(1 - x, 4)

def ease_out_expo(x):
    if x == 1:
        return 1
    else:
        return 1 - pow(2, -10 * x)

def get_tracks(distance, seconds, ease_func):
    tracks = [0]
    offsets = [0]
    for t in np.arange(0.0, seconds, 0.1):
        ease = globals()[ease_func]
        offset = round(ease(t / seconds) * distance)
        tracks.append(offset - offsets[-1])
        offsets.append(offset)
    return offsets, tracks

def drag_and_drop(browser, offset=26.5):
    knob = browser.find_element_by_id('nc_1_n1z')
    offsets, tracks = get_tracks(offset, 12, 'ease_out_expo')
    ActionChains(browser).click_and_hold(knob).perform()
    for x in tracks:
        ActionChains(browser).move_by_offset(x, 0).perform()
    ActionChains(browser).pause(0.5).release().perform()

def gen_session(cookie):
    session = requests.session()
    cookie_dict = {}
    list = cookie.split(';')
    for i in list:
        try:
            cookie_dict[i.split('=')[0]] = i.split('=')[1]
        except IndexError:
            cookie_dict[''] = i
    requests.utils.add_dict_to_cookiejar(session.cookies, cookie_dict)
    return session

class TaobaoSpider(object):
    def __init__(self, cookies_list):
        self.path = askdirectory(title='Select the information save folder')
        if str(self.path) == "":
            sys.exit(1)
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
        }
        option = ChromeOptions()
        option.add_experimental_option('excludeSwitches', ['enable-automation'])
        option.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})  # Don't load pictures to speed up access
        option.add_argument('--headless')
        self.driver = webdriver.Chrome(options=option)
        self.driver.get('https://i.taobao.com/my_taobao.htm')
        for i in cookies_list:
            self.driver.add_cookie(cookie_dict=i)
        self.driver.get('https://i.taobao.com/my_taobao.htm')
        self.wait = WebDriverWait(self.driver, 20)  # The timeout duration is 10s

    # Simulate sliding down browsing
    def swipe_down(self, second):
        for i in range(int(second / 0.1)):
            # Simulate sliding up and down according to the value of i
            if (i % 2 == 0):
                js = "var q=document.documentElement.scrollTop=" + str(300 + 400 * i)
            else:
                js = "var q=document.documentElement.scrollTop=" + str(200 * i)
            self.driver.execute_script(js)
            time.sleep(0.1)

        js = "var q=document.documentElement.scrollTop=100000"
        self.driver.execute_script(js)
        time.sleep(0.1)

    # Crawl the baby product data I have bought on Taobao. pn defines how many pages of data to crawl
    def crawl_good_buy_data(self, pn=3):

        # Crawler the data of baby products I have bought
        self.driver.get("https://buyertrade.taobao.com/trade/itemlist/list_bought_items.htm")

        # Traverse all pages
        
        for page in trange(1, pn):
            data_list = []

            # Wait until all the purchased baby product data on this page are loaded
            good_total = self.wait.until(
                EC.presence_of_element_located((By.CSS_SELECTOR, '#tp-bought-root > div.js-order-container')))

            # Get the source code of this page
            html = self.driver.page_source

            # pq module parsing web page source code
            doc = pq(html)

            # # Store the baby data that has been bought on this page
            good_items = doc('#tp-bought-root .js-order-container').items()

            # Traverse all the pages
            for item in good_items:
                # Purchase time and order No
                good_time_and_id = item.find('.bought-wrapper-mod__head-info-cell___29cDO').text().replace('\n', "").replace('\r', "")
                # Merchant name
                # good_merchant = item.find('.seller-mod__container___1w0Cx').text().replace('\n', "").replace('\r', "")
                good_merchant = item.find('.bought-wrapper-mod__seller-container___3dAK3').text().replace('\n', "").replace('\r', "")
                # Trade name
                # good_name = item.find('.sol-mod__no-br___1PwLO').text().replace('\n', "").replace('\r', "")
                good_name = item.find('.sol-mod__no-br___3Ev-2').text().replace('\n', "").replace('\r', "")
                # commodity price  
                good_price = item.find('.price-mod__price___cYafX').text().replace('\n', "").replace('\r', "")
                # Only the purchase time, order number, merchant name and commodity name are listed
                # Please get the rest by yourself
                data_list.append(good_time_and_id)
                data_list.append(good_merchant)
                data_list.append(good_name)
                data_list.append(good_price)
                #print(good_time_and_id, good_merchant, good_name)
                #file_path = os.path.join(os.path.dirname(__file__) + '/user_orders.json')
                # file_path = "../Spiders/taobao/user_orders.json"
                json_str = json.dumps(data_list)
                with open(self.path + os.sep + 'user_orders.json', 'a') as f:
                    f.write(json_str)

            # print('\n\n')

            # Most people are detected as robots because they further simulate manual operation
            # Simulate manual downward browsing of goods, that is, simulate sliding operation to prevent being recognized as a robot
            # Random sliding delay time
            swipe_time = random.randint(1, 3)
            self.swipe_down(swipe_time)

            # Wait for the next page button to appear
            good_total = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.pagination-next')))
            good_total.click()
            time.sleep(2)
            # while 1:
            #     time.sleep(0.2)
            #     try:
            #         good_total = self.driver.find_element_by_xpath('//li[@title = "next"]')
            #         break
            #     except:
            #         continue
            # # Click the next button
            # while 1:
            #     time.sleep(2)
            #     try:
            #         good_total.click()
            #         break
            #     except Exception:
            #         pass

    # How many pages do you want to collect? The default is three pages https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow=60
    def get_choucang_item(self, page=3):
        url = 'https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow={}'
        pn = 0
        json_list = []
        for i in trange(page):
            self.driver.get(url.format(pn))
            pn += 30
            html_str = self.driver.page_source
            if html_str == '':
                break
            if 'Sign in' in html_str:
                raise Exception('Sign in')
            obj_list = etree.HTML(html_str).xpath('//li')
            for obj in obj_list:
                item = {}
                item['title'] = ''.join([i.strip() for i in obj.xpath('./div[@class="img-item-title"]//text()')])
                item['url'] = ''.join([i.strip() for i in obj.xpath('./div[@class="img-item-title"]/a/@href')])
                item['price'] = ''.join([i.strip() for i in obj.xpath('./div[@class="price-container"]//text()')])
                if item['price'] == '':
                    item['price'] = 'invalid'
                json_list.append(item)
        # file_path = os.path.join(os.path.dirname(__file__) + '/shoucang_item.json')
        json_str = json.dumps(json_list)
        with open(self.path + os.sep + 'shoucang_item.json', 'w') as f:
            f.write(json_str)

    # How many pages do you want to browse? The default is three pages https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow=60
    def get_footmark_item(self, page=3):
        url = 'https://www.taobao.com/markets/footmark/tbfoot'
        self.driver.get(url)
        pn = 0
        item_num = 0
        json_list = []
        for i in trange(page):
            html_str = self.driver.page_source
            obj_list = etree.HTML(html_str).xpath('//div[@class="item-list J_redsList"]/div')[item_num:]
            for obj in obj_list:
                item_num += 1
                item = {}
                item['date'] = ''.join([i.strip() for i in obj.xpath('./@data-date')])
                item['url'] = ''.join([i.strip() for i in obj.xpath('./a/@href')])
                item['name'] = ''.join([i.strip() for i in obj.xpath('.//div[@class="title"]//text()')])
                item['price'] = ''.join([i.strip() for i in obj.xpath('.//div[@class="price-box"]//text()')])
                json_list.append(item)
            self.driver.execute_script('window.scrollTo(0,1000000)')
        # file_path = os.path.join(os.path.dirname(__file__) + '/footmark_item.json')
        json_str = json.dumps(json_list)
        with open(self.path + os.sep + 'footmark_item.json', 'w') as f:
            f.write(json_str)

    # address
    def get_addr(self):
        url = 'https://member1.taobao.com/member/fresh/deliver_address.htm'
        self.driver.get(url)
        html_str = self.driver.page_source
        obj_list = etree.HTML(html_str).xpath('//tbody[@class="next-table-body"]/tr')
        data_list = []
        for obj in obj_list:
            item = {}
            item['name'] = obj.xpath('.//td[1]//text()')
            item['area'] = obj.xpath('.//td[2]//text()')
            item['detail_area'] = obj.xpath('.//td[3]//text()')
            item['youbian'] = obj.xpath('.//td[4]//text()')
            item['mobile'] = obj.xpath('.//td[5]//text()')
            data_list.append(item)
        # file_path = os.path.join(os.path.dirname(__file__) + '/addr.json')
        json_str = json.dumps(data_list)
        with open(self.path + os.sep + 'address.json', 'w') as f:
            f.write(json_str)


if __name__ == '__main__':
    # pass
    cookie_list = json.loads(open('taobao_cookies.json', 'r').read())
    t = TaobaoSpider(cookie_list)
    t.get_orders()
    # t.crawl_good_buy_data()
    # t.get_addr()
    # t.get_choucang_item()
    # t.get_footmark_item()

Github address:

https://github.com/kangvcar/InfoSpider

Explanation at station b:

https://www.bilibili.com/video/BV14f4y1R7oF/

Interested students can download to learn~

PS: if you think my sharing is good, you are welcome to like it and watch it.

END

Programmer Think

Crawl at will! Super open source Python crawler toolbox

Hot Topics