Reverse crawler 14 Mongo introduction

Posted by DanDaBeginner on Wed, 02 Feb 2022 12:23:15 +0100

Reverse crawler 14 Mongo introduction

1, The difference between MongoDB and MySQL

MongoDB is a non relational database, which stores any form of json format data; MySQL is a relational database, which can only store table data with pre-defined fields. The following is a comparison of different data names between MongoDB and mysql.

MySQLMongoDB
Database (folder)Database (folder)
Table (file, different tables can have relationships)Set (file, no relationship between different sets)
Record (one row of table)Document (a json format data)

2, When to use relational? When to use non relational?

Relational database is more suitable for the scenario that requires business logic development of data. It is widely used in the development of the whole Web stack. Users submit information to the back end through the front end, and the back end saves the data to be stored in the relational database after processing through business logic; After the user applies to view the information at the front end, the back end reads the data from the relational database, sends it to the front end after being processed by the business logic, and then renders it in front of the user. In this process, the contents stored in the database are strictly controlled by the business logic. Therefore, before creating the database, it is necessary to determine the field name of each table and the relationship between different tables according to the business logic. After these relationships are determined, they will rarely change unless required by the business logic.

Compared with relational database, non relational database is more convenient to use, plug and play. There is no need to determine the field, table and table relationship before use. As long as the data format meets json, it can be stored. This is very convenient for storing a large amount of irregular data. The data of each website crawled by the crawler is not standardized and unified. If you use a relational database, you need to create a new table for the different information that each website needs to crawl before data warehousing. The non relational database does not need this step. After obtaining the data and sorting it into json format, it can be inserted into the database, which is very convenient.

Therefore, in the business scenario where only data needs to be saved without business logic processing, it is more appropriate to use non relational database, which is also the reason why we need to learn MongoDB.

3, Simple use of MongoDB (understand)

The installation of MongoDB will not be repeated here. It is everywhere on the Internet. Let's get straight to the point.

show dbs			Show all databases (abbreviated)
show databases		Show all databases (write all)
db					View the database currently in use
use xxx				Switch database
db.dropDatabase()	Delete database
show collections	Displays all collections in the current database
db.collection_name.insert({})	Insert a piece of data into the set{}Middle put json Format data. If the collection does not exist, it will be created, and there is no size limit
db.createCollection(name, {options})	Manually create a collection, you can add options To limit the set, let's give an example
db.createCollection("xxx_log", {capped:true, size:255})		capped:Scroll or not, size:size
 The above creation method is particularly suitable for storing log data. When the log data is greater than size After the size, the oldest data will be automatically cleared and the latest data will be inserted
db.collection_name.isCapped()	Determine whether the set has an upper capacity limit

4, Addition, deletion, modification and query of MongoDB

1. Common data types in mongodb (understand)

Object ID: Primary key ID
String: character string
Boolean: Boolean value
Integer: number
Double: decimal
Arrays: array
Object: Document (associated with other objects) {sname: Li Jiacheng, sage: 18, class: {cccc}}
Null: Null value
Timestamp: time stamp
Date: Time and date

2. Add data to mongodb

db.collection_name.insert({Field 1:Value 1, Field 2:Value 2})
db.student.insert({name: 'Jay Chou', age: 18, hobby: ['Oh, good!', 'Be cool']})

Note: if the collection does not exist, it will be created automatically

3. Modify data by mongodb

3.1 update

db.collection_name.update({query criteria}, {Content to be modified}, {multi: false, upsert: true})
multi Default to false,Only one piece of data will be modified
upsert Default to true,If there is new data in the content to be modified, it will be added
db.student.update({name: 'Jay Chou'}, {$set: {title: 'Chinese pop king', age: 16}})

s e t and no have set and no The difference between set and no set:

$set will only modify the fields currently given, and other contents will be retained

Without $set, only the currently given fields will be retained, and other contents will be deleted

If multi is true, $set must be used, otherwise an error will be reported.

3.2 save (understand)

db.collection_name.save({Data to be saved})
db.student.save({name: 'Wang Lihong', age: 188})

It is equivalent to adding a new Wang Leehom data. If the data to be saved contains "existing data"_ id 'information is equivalent to the update function

4. mongodb delete data

4.1 remove()

db.collection_name.remove({condition}, {justOne: true|false})
db.student.remove({name: 'Wang Lihong'}, {justOne: true})

4.2 deleteOne()

db.collection_name.deleteOne({condition})
db.student.deleteOne({name: 'Cai Yilin'})

4.3 deleteMany()

db.collection_name.deleteMany({condition})
db.student.deleteMany({type: 'singer'})

5. mongodb query data

Prepare data:

db.student.insert([
	{name: "Zhu Yuanzhang", age:800, address:'Fengyang, Anhui Province', score: 160},
	{name: "Zhu Di", age:750, address:'Nanjing City, Jiangsu Province', score: 120},
	{name: "Zhu gaochi", age:700, address:'The Forbidden City in Beijing', score: 90},
	{name: "Li Jiacheng", age:38, address:'Hong Kong xxx street', score: 70},
	{name: "Cauliflower vine", age:28, address:'Guangdong Province xxx city', score: 80},
	{name: "Big Lao Wang", age:33, address:'Mars first satellite', score: -60},
	{name: "Baa Baa", age:33, address:'The black hole next to Kepler 225', score: -160}
])

5.1 general query

db.student.find()		Query all
db.student.findOne()	Query one
db.student.find({condition})  Condition query

5.2 comparison operation

Equal to: the default is equal to judgment, $eq
 Less than: $lt (less than)
Less than or equal to: $lte (less than equal)
Greater than: $gt (greater than)
Greater than or equal to: $gte (greater than equal)
Not equal to: $ne (not equal)
db.student.find({age:28})			// Query students aged 28
db.student.find({age:{$eq:28}})		// Query students aged 28
db.student.find({age:{$gt:28}})		// Query students older than 28
db.student.find({age:{$gte:28}})	// Query students aged 28 or older
db.student.find({age:{$lt:38}})		// Query students younger than 38
db.student.find({age:{$lte:38}})	// Query students aged 38 or younger
db.student.find({age:{$ne:38}})		// Query students whose age is not equal to 38

5.3 logical operators

  1. and

    $and: [condition 1, condition 2, condition 3]

The query age is equal to 33, and the name is'Big Lao Wang'Students
db.student.find({$and: [{age: {$eq:33}}, {name: 'Big Lao Wang'}]})
  1. or

    $or: [condition 1, condition 2, condition 3]

Query name'Li Jiacheng',Or, students over 100 years old
db.student.find({$or: [{name: 'Li Jiacheng'}, {age: {$gt: 100}}]})
  1. nor

    $nor: [condition 1, condition 2, condition 3]

The query age is not equal to 38, and the name is not called'Zhu Yuanzhang'Students
db.student.find({$nor: [{age: {$lt:38}}, {name: 'Zhu Yuanzhang'}]})

5.4 range operator

use i n , in, In, nin judge whether the data is in an array

Query students aged 28 or 38
db.student.find({age: {$in: [28, 38]}})

5.5 regular expressions

Regular expression matching using $regex

The inquiry address is student information in Beijing
db.student.find({address: {$regex: '^Beijing'}})
db.student.find({address: /^Beijing/})

5.6 user defined query (understand)

mongo shell is a js execution environment

Use $where to write a function that returns data that meets the conditions

Query the information of students older than 38
db.student.find({$where: function(){return this.age > 38}})

5.7 skip and limit

db.student.find().skip(3).limit(3)

Skip 3 and extract 3. Similar to limit 3, 3 can be used for paging

5.8 projection

The final projected results of the query can be controlled

Query the names, ages and scores of all students' data, but do not display their names_id
db.student.find({}, {_id: 0, name: 1, age: 1, score: 1})

The fields you need to see are given to 1

Note that except_ 0 and 1 cannot coexist except id

5.9 sorting

sort({field: 1, field: - 1})

1 indicates ascending order

-1 indicates descending order

Rank the grades of all students in descending order
db.student.find().sort({score:-1})

5.10 statistical quantity

Count query quantity

Count the number of students aged 33
db.student.count({age: 33})

5, Use of pymongo

1. Add, delete, modify and query

from audioop import add
from pymongo import MongoClient

def get_db(database, user=None, pwd=None):
    client = MongoClient(host='localhost', port=27017)  # Default port number: 27017
    # If you have an account, you need to log in
    # admin = client['admin']
    # admin.authenticate(user, pwd)
    # If the user name and password are not set, you can switch directly
    db = client[database]     # use haha
    return db

def add_one(database, table, data):
    db = get_db(database)
    result = db[table].insert_one(data)
    return result

def add_many(database, table, data_list):
    db = get_db(database)
    result = db[table].insert_many(data_list)
    return result

def upd(database, table, condition, data):
    db = get_db(database)
    result = db[table].update_many(condition, {'$set': data})
    return result

def delete(database, table, condition):
    db = get_db(database)
    result = db[table].delete_many(condition)
    return result

def query(database, table, condition):
    db = get_db(database)
    result = db[table].find(condition)
    return list(result)

if __name__ == '__main__':
    # Add a student information
    ret = add_one('haha', 'student', {'name': 'Jay Chou', 'age': 18, 'address': 'Yuan universe', 'score': 'Infinity'})
    print(ret)

    # Add multiple student information
    data_list = [
        {'name': 'Cai Yilin', 'age': 17, 'address': 'Yuan universe', 'score': 'Infinity'},
        {'name': 'Eason Chan', 'age': 19, 'address': 'Yuan universe', 'score': 'Infinitesimal'},
        {'name': 'Xue You Zhang', 'age': 20, 'address': 'Yuan universe', 'score': 'Infinitesimal'}
    ]
    ret = add_many('haha', 'student', data_list)
    print(ret)

    # Modify student information
    ret = upd('haha', 'student', {"address": "Yuan universe"}, {"score": 100})
    print(ret)

    # Delete student information
    ret = delete('haha', 'student', {"address": "Yuan universe"})
    print(ret)

    # Query the information of students over 33 years old
    ret = query('haha', 'student', {"age": {"$gt": 33}})
    for r in ret:
        print(r)
        
    # Query the information of Beijing students
    ret = query('haha', 'student', {"address": {"$regex": "^north"}})
    for r in ret:
        print(r)

2. Capture second-hand housing information

import requests
from lxml import etree
from mangodb import add_many
import pymysql

def get_page_source(url):
    resp = requests.get(url)
    page_source = resp.text
    return page_source

def parse(html):
    tree = etree.HTML(html)
    li_list = tree.xpath('//ul[@class="sellListContent"]/li')
    result = []
    for li in li_list:
        title = li.xpath('./div[1]/div[1]/a/text()')[0]
        address = ' '.join(li.xpath('./div[1]/div[2]/div/a/text()'))
        houseInfo = li.xpath('./div[1]/div[3]/div/text()')[0]
        starInfo = li.xpath('./div[1]/div[4]/text()')[0]
        tag = ' '.join(li.xpath('./div[1]/div[5]/span/text()'))
        total_price = li.xpath('./div[1]/div[6]/div[1]/span/text()')[0] + 'Ten thousand yuan'
        per_price = li.xpath('./div[1]/div[6]/div[2]/span/text()')[0]
        
        dic = {
            "title": title,
            "address": address,
            "houseInfo": houseInfo,
            "starInfo": starInfo,
            "tag": tag,
            "total_price": total_price,
            "per_price": per_price
        }
        result.append(dic)
    return result

def save_to_mongo(data_list):
    add_many('ershoufang', 'ershoufang', data_list)
    print("One page saved!")

def save_to_mysql(data_list):
    try:
        conn = pymysql.connect(
            host='localhost',
            port=3306,
            user='root',  # The first four arguments is based on DB-API 2.0 recommendation.
            password="xxxxxx",
            database='spider'
        )
        cursor = conn.cursor()
        sql = """
        insert into ershoufang(title, address, houseInfo, startInfo, tag, total_price, per_price) values
        (%s, %s, %s, %s, %s, %s, %s)
        """
        lst = (tuple(dic.values()) for dic in data_list)
        cursor.executemany(sql, lst)
        conn.commit()
        print("One page saved!")
    except:
        conn.rollback()
    finally:
        if cursor:
            cursor.close()
        if conn:
            conn.close()

if __name__ == '__main__':
    for i in range(1, 31):
        url = "https://bj.lianjia.com/ershoufang/pg{i}/"
        page_source = get_page_source(url)
        data_list = parse(page_source)
        # save_to_mongo(data_list)
        save_to_mysql(data_list)

Topics: Database MongoDB crawler