Reverse crawler 14 Mongo introduction
1, The difference between MongoDB and MySQL
MongoDB is a non relational database, which stores any form of json format data; MySQL is a relational database, which can only store table data with pre-defined fields. The following is a comparison of different data names between MongoDB and mysql.
MySQL | MongoDB |
---|---|
Database (folder) | Database (folder) |
Table (file, different tables can have relationships) | Set (file, no relationship between different sets) |
Record (one row of table) | Document (a json format data) |
2, When to use relational? When to use non relational?
Relational database is more suitable for the scenario that requires business logic development of data. It is widely used in the development of the whole Web stack. Users submit information to the back end through the front end, and the back end saves the data to be stored in the relational database after processing through business logic; After the user applies to view the information at the front end, the back end reads the data from the relational database, sends it to the front end after being processed by the business logic, and then renders it in front of the user. In this process, the contents stored in the database are strictly controlled by the business logic. Therefore, before creating the database, it is necessary to determine the field name of each table and the relationship between different tables according to the business logic. After these relationships are determined, they will rarely change unless required by the business logic.
Compared with relational database, non relational database is more convenient to use, plug and play. There is no need to determine the field, table and table relationship before use. As long as the data format meets json, it can be stored. This is very convenient for storing a large amount of irregular data. The data of each website crawled by the crawler is not standardized and unified. If you use a relational database, you need to create a new table for the different information that each website needs to crawl before data warehousing. The non relational database does not need this step. After obtaining the data and sorting it into json format, it can be inserted into the database, which is very convenient.
Therefore, in the business scenario where only data needs to be saved without business logic processing, it is more appropriate to use non relational database, which is also the reason why we need to learn MongoDB.
3, Simple use of MongoDB (understand)
The installation of MongoDB will not be repeated here. It is everywhere on the Internet. Let's get straight to the point.
show dbs Show all databases (abbreviated) show databases Show all databases (write all) db View the database currently in use use xxx Switch database db.dropDatabase() Delete database show collections Displays all collections in the current database db.collection_name.insert({}) Insert a piece of data into the set{}Middle put json Format data. If the collection does not exist, it will be created, and there is no size limit db.createCollection(name, {options}) Manually create a collection, you can add options To limit the set, let's give an example db.createCollection("xxx_log", {capped:true, size:255}) capped:Scroll or not, size:size The above creation method is particularly suitable for storing log data. When the log data is greater than size After the size, the oldest data will be automatically cleared and the latest data will be inserted db.collection_name.isCapped() Determine whether the set has an upper capacity limit
4, Addition, deletion, modification and query of MongoDB
1. Common data types in mongodb (understand)
Object ID: Primary key ID String: character string Boolean: Boolean value Integer: number Double: decimal Arrays: array Object: Document (associated with other objects) {sname: Li Jiacheng, sage: 18, class: {cccc}} Null: Null value Timestamp: time stamp Date: Time and date
2. Add data to mongodb
db.collection_name.insert({Field 1:Value 1, Field 2:Value 2})
db.student.insert({name: 'Jay Chou', age: 18, hobby: ['Oh, good!', 'Be cool']})
Note: if the collection does not exist, it will be created automatically
3. Modify data by mongodb
3.1 update
db.collection_name.update({query criteria}, {Content to be modified}, {multi: false, upsert: true}) multi Default to false,Only one piece of data will be modified upsert Default to true,If there is new data in the content to be modified, it will be added
db.student.update({name: 'Jay Chou'}, {$set: {title: 'Chinese pop king', age: 16}})
s e t and no have set and no The difference between set and no set:
$set will only modify the fields currently given, and other contents will be retained
Without $set, only the currently given fields will be retained, and other contents will be deleted
If multi is true, $set must be used, otherwise an error will be reported.
3.2 save (understand)
db.collection_name.save({Data to be saved})
db.student.save({name: 'Wang Lihong', age: 188})
It is equivalent to adding a new Wang Leehom data. If the data to be saved contains "existing data"_ id 'information is equivalent to the update function
4. mongodb delete data
4.1 remove()
db.collection_name.remove({condition}, {justOne: true|false})
db.student.remove({name: 'Wang Lihong'}, {justOne: true})
4.2 deleteOne()
db.collection_name.deleteOne({condition})
db.student.deleteOne({name: 'Cai Yilin'})
4.3 deleteMany()
db.collection_name.deleteMany({condition})
db.student.deleteMany({type: 'singer'})
5. mongodb query data
Prepare data:
db.student.insert([ {name: "Zhu Yuanzhang", age:800, address:'Fengyang, Anhui Province', score: 160}, {name: "Zhu Di", age:750, address:'Nanjing City, Jiangsu Province', score: 120}, {name: "Zhu gaochi", age:700, address:'The Forbidden City in Beijing', score: 90}, {name: "Li Jiacheng", age:38, address:'Hong Kong xxx street', score: 70}, {name: "Cauliflower vine", age:28, address:'Guangdong Province xxx city', score: 80}, {name: "Big Lao Wang", age:33, address:'Mars first satellite', score: -60}, {name: "Baa Baa", age:33, address:'The black hole next to Kepler 225', score: -160} ])
5.1 general query
db.student.find() Query all db.student.findOne() Query one db.student.find({condition}) Condition query
5.2 comparison operation
Equal to: the default is equal to judgment, $eq Less than: $lt (less than) Less than or equal to: $lte (less than equal) Greater than: $gt (greater than) Greater than or equal to: $gte (greater than equal) Not equal to: $ne (not equal)
db.student.find({age:28}) // Query students aged 28 db.student.find({age:{$eq:28}}) // Query students aged 28 db.student.find({age:{$gt:28}}) // Query students older than 28 db.student.find({age:{$gte:28}}) // Query students aged 28 or older db.student.find({age:{$lt:38}}) // Query students younger than 38 db.student.find({age:{$lte:38}}) // Query students aged 38 or younger db.student.find({age:{$ne:38}}) // Query students whose age is not equal to 38
5.3 logical operators
-
and
$and: [condition 1, condition 2, condition 3]
The query age is equal to 33, and the name is'Big Lao Wang'Students db.student.find({$and: [{age: {$eq:33}}, {name: 'Big Lao Wang'}]})
-
or
$or: [condition 1, condition 2, condition 3]
Query name'Li Jiacheng',Or, students over 100 years old db.student.find({$or: [{name: 'Li Jiacheng'}, {age: {$gt: 100}}]})
-
nor
$nor: [condition 1, condition 2, condition 3]
The query age is not equal to 38, and the name is not called'Zhu Yuanzhang'Students db.student.find({$nor: [{age: {$lt:38}}, {name: 'Zhu Yuanzhang'}]})
5.4 range operator
use i n , in, In, nin judge whether the data is in an array
Query students aged 28 or 38 db.student.find({age: {$in: [28, 38]}})
5.5 regular expressions
Regular expression matching using $regex
The inquiry address is student information in Beijing db.student.find({address: {$regex: '^Beijing'}}) db.student.find({address: /^Beijing/})
5.6 user defined query (understand)
mongo shell is a js execution environment
Use $where to write a function that returns data that meets the conditions
Query the information of students older than 38 db.student.find({$where: function(){return this.age > 38}})
5.7 skip and limit
db.student.find().skip(3).limit(3)
Skip 3 and extract 3. Similar to limit 3, 3 can be used for paging
5.8 projection
The final projected results of the query can be controlled
Query the names, ages and scores of all students' data, but do not display their names_id db.student.find({}, {_id: 0, name: 1, age: 1, score: 1})
The fields you need to see are given to 1
Note that except_ 0 and 1 cannot coexist except id
5.9 sorting
sort({field: 1, field: - 1})
1 indicates ascending order
-1 indicates descending order
Rank the grades of all students in descending order db.student.find().sort({score:-1})
5.10 statistical quantity
Count query quantity
Count the number of students aged 33 db.student.count({age: 33})
5, Use of pymongo
1. Add, delete, modify and query
from audioop import add from pymongo import MongoClient def get_db(database, user=None, pwd=None): client = MongoClient(host='localhost', port=27017) # Default port number: 27017 # If you have an account, you need to log in # admin = client['admin'] # admin.authenticate(user, pwd) # If the user name and password are not set, you can switch directly db = client[database] # use haha return db def add_one(database, table, data): db = get_db(database) result = db[table].insert_one(data) return result def add_many(database, table, data_list): db = get_db(database) result = db[table].insert_many(data_list) return result def upd(database, table, condition, data): db = get_db(database) result = db[table].update_many(condition, {'$set': data}) return result def delete(database, table, condition): db = get_db(database) result = db[table].delete_many(condition) return result def query(database, table, condition): db = get_db(database) result = db[table].find(condition) return list(result) if __name__ == '__main__': # Add a student information ret = add_one('haha', 'student', {'name': 'Jay Chou', 'age': 18, 'address': 'Yuan universe', 'score': 'Infinity'}) print(ret) # Add multiple student information data_list = [ {'name': 'Cai Yilin', 'age': 17, 'address': 'Yuan universe', 'score': 'Infinity'}, {'name': 'Eason Chan', 'age': 19, 'address': 'Yuan universe', 'score': 'Infinitesimal'}, {'name': 'Xue You Zhang', 'age': 20, 'address': 'Yuan universe', 'score': 'Infinitesimal'} ] ret = add_many('haha', 'student', data_list) print(ret) # Modify student information ret = upd('haha', 'student', {"address": "Yuan universe"}, {"score": 100}) print(ret) # Delete student information ret = delete('haha', 'student', {"address": "Yuan universe"}) print(ret) # Query the information of students over 33 years old ret = query('haha', 'student', {"age": {"$gt": 33}}) for r in ret: print(r) # Query the information of Beijing students ret = query('haha', 'student', {"address": {"$regex": "^north"}}) for r in ret: print(r)
2. Capture second-hand housing information
import requests from lxml import etree from mangodb import add_many import pymysql def get_page_source(url): resp = requests.get(url) page_source = resp.text return page_source def parse(html): tree = etree.HTML(html) li_list = tree.xpath('//ul[@class="sellListContent"]/li') result = [] for li in li_list: title = li.xpath('./div[1]/div[1]/a/text()')[0] address = ' '.join(li.xpath('./div[1]/div[2]/div/a/text()')) houseInfo = li.xpath('./div[1]/div[3]/div/text()')[0] starInfo = li.xpath('./div[1]/div[4]/text()')[0] tag = ' '.join(li.xpath('./div[1]/div[5]/span/text()')) total_price = li.xpath('./div[1]/div[6]/div[1]/span/text()')[0] + 'Ten thousand yuan' per_price = li.xpath('./div[1]/div[6]/div[2]/span/text()')[0] dic = { "title": title, "address": address, "houseInfo": houseInfo, "starInfo": starInfo, "tag": tag, "total_price": total_price, "per_price": per_price } result.append(dic) return result def save_to_mongo(data_list): add_many('ershoufang', 'ershoufang', data_list) print("One page saved!") def save_to_mysql(data_list): try: conn = pymysql.connect( host='localhost', port=3306, user='root', # The first four arguments is based on DB-API 2.0 recommendation. password="xxxxxx", database='spider' ) cursor = conn.cursor() sql = """ insert into ershoufang(title, address, houseInfo, startInfo, tag, total_price, per_price) values (%s, %s, %s, %s, %s, %s, %s) """ lst = (tuple(dic.values()) for dic in data_list) cursor.executemany(sql, lst) conn.commit() print("One page saved!") except: conn.rollback() finally: if cursor: cursor.close() if conn: conn.close() if __name__ == '__main__': for i in range(1, 31): url = "https://bj.lianjia.com/ershoufang/pg{i}/" page_source = get_page_source(url) data_list = parse(page_source) # save_to_mongo(data_list) save_to_mysql(data_list)