Learn to crawl by hand the fourth bullet - data storage

Posted by fgm on Mon, 13 Dec 2021 06:13:54 +0100

What is data storage

I really don't need to explain this. It is to save the crawled data. The data is stored in various forms, but it is mainly divided into two categories: one is simply saved as text files, such as txt, json, csv, etc., and the other is saved to databases, such as MySQL, MongoDB, Redis, etc. Next, let's learn these methods~

preparation

Before learning data storage, we need to crawl the data first. I won't write this part of the code here. I will directly use the code demonstration of crawling watercress in front. If you don't understand, go back to my previous blog

Save as text

File opening method

tip: here is a parameter for file saving

parametermeaning
rOpen in read-only mode (default)
rbOpen as binary read-only (for audio, pictures, video)
r+Open file read-write
rb+Open as binary read / write
wOpen the file by writing (overwrite if the file exists, and create if it does not exist)
wbWrite in binary mode (overwrite if the file exists, and create if it does not exist)
w+Open the file in read-write mode (overwrite if the file exists, and create a new one if it does not exist)
wb+Open the file in binary read-write mode (overwrite if the file exists, and create if it does not exist)
aOpen file as append
abOpen file as binary append
a+Open file read-write
ab+Append open file in binary

Save as txt text

This should be the simplest way to save crawler data. Look at the code directly. Here, for convenience, I define all data saving as functions and call the corresponding functions to save.

# The first way to write:
# In this way, you need to define file equal to
def save_txt_2(mystr):
    file = open('../Include/film-1.txt', 'a', encoding='utf-8')
    file.write(mystr+'\n')
    file.close()

# The second way to write:
# This method simplifies a method and is generally used
# Save data as txt (short form)
def save_txt_1(mystr):
    with open('../Include/film.txt', 'a', encoding='utf-8') as file:
        file.write(f'{mystr}\n')

Save as JSON

json, as a common data storage format, is usually in the form of a key value pair wrapped in curly braces, such as the following data

{'name':'Zhang San', 'age':18, 'sex':'male'}
# The data is stored as json
def save_json(data):
    with open('../Include/film.json', 'a', encoding='utf-8') as file:
        file.write(f'{data}\n')

Save as CSV

Comma separated values (CSV, sometimes called character separated values, because the separating character can also be not a comma), and its file stores table data in plain text (numbers and text). Plain text means that the file is a sequence of characters and does not contain data that must be interpreted like binary numbers. CSV files consist of any number of records separated by some line break; each record consists of fields separated by other characters or strings, most commonly commas or tabs.
Because we usually save data in the form of a dictionary in a crawler, here I will demonstrate how to store dictionary data as a csv file.

# Data is stored as csv
def save_csv(data):
    with open('../Include/film.csv', 'a', encoding='utf-8') as file:
        # Note that the fields in the fieldnames here must correspond to the keys in the dictionary you want to store, otherwise an error will be reported
        fieldnames = ['Movie Title:', 'Director and actor:', 'Movie rating:', 'Number of evaluators:', 'Film summary:']
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writerow(data)

Save to database

Save to MySQL

As an open source relational data, MySQL is deeply loved by developers, and we often use it in the process of crawler data storage.
Different from the previous text storage, we usually need to connect to the database before we can perform the storage operation. Here, I see a database named movies and a data table named movie in mysql in advance.

# Data storage to MySQL
def save_mysql(data):
    # Establish connection
    db = pymysql.connect(host='127.0.0.1', user='root', password='123456', port=3306, db='movies')

    # Create cursor
    cursor = db.cursor()

    # sql statement
    sql = "insert into movice(mname,director,score,allcount,summary) values (%s,%s,%s,%s,%s) "
    # operation
    data = (data['Movie Title:'], data['Director and actor:'], data['Movie rating:'], data['Number of evaluators:'], data['Film summary:'])
    try:
        cursor.execute(sql,data)
        db.commit()
    except Exception as e:
        print('Failed to insert data', e)
        db.rollback()  # RollBACK 
    # Close cursor
    cursor.close()
    # Close connection
    db.close()

Save to MongoDB

MongoDB is a database based on distributed file storage. Written in C + +. It aims to provide scalable high-performance data storage solutions for WEB applications.
MongoDB is a product between relational database and non relational database. It is the most functional and relational database among non relational databases. The data structure it supports is very loose. It is a json like bson format, so it can store more complex data types. Mongo's biggest feature is that the query language it supports is very powerful. Its syntax is a little similar to the object-oriented query language. It can almost realize most of the functions similar to single table query in relational database, and it also supports indexing of data.
Like the database connection above, we need to connect to the database before storing data, and then execute the corresponding database statement to complete the operation.

# Store data to MongoDB
def save_mongo(data):
    # Create database connection
    client = pymongo.MongoClient(host='localhost', port=27017)
    # Specify database
    db = client.movies
    # Specify collection
    collection = db.movie
    # Perform operation
    result = collection.insert_one(data)
    print(result)


Save to Redis

Redis (Remote Dictionary Server), i.e. remote dictionary service, is an open source log type key value database written in ANSI C language, supporting network, memory based and persistent, and provides API s in multiple languages. Redis is a high-performance key value database. The emergence of redis largely compensates for the shortage of key/value storage such as memcached, and can play a good supplementary role to relational databases on some occasions.

# Store data to Redis
def save_redis(data):
    r = redis.Redis(host='localhost', port=6379, db=1, decode_responses=True, password='123456')
    r.hset('movies', data['Movie Title:'], json.dumps(data))

# Reading Redis data
def read_redis():
    r = redis.Redis(host='localhost', port=6379, db=1, decode_responses=True, password='123456')
    for k in r.hkeys('movies'):
        d = r.hget('movies', k)
        print(json.loads(d))


summary

Through the above introduction, we have basically learned most methods of crawler data storage, which are often used in our actual production process. Some people will say that there are many other types of files, such as the xlsx format of tables and so on... But everyone who has used it should know that in fact, it is stored in the same way as csv. We only need to make small modifications to achieve it.
That's all for today. In the next issue, I will lead you to use our stored data for simple data analysis and data visualization.
Rush!!!!

author: KK
time: December 13, 2021 12:05:45
flag: 10/30

Topics: Python crawler