Crawler series: using MySQL to store data

Posted by Terminator on Thu, 09 Dec 2021 18:04:49 +0100

In the last article, we explained How do crawlers store CSV files In this article, we explain how to save the collected data to MySQL database.

MySQL is the most popular open source relational database management system. It is surprising that an open source project is so competitive. Its popularity is constantly approaching two closed source commercial database systems: Microsoft's SQL Server and Oracle's Oracle database (MySQL was acquired by Oracle in 2010).

Its popularity is worthy of its name. MySQL is the best choice for most applications. It is a very flexible, stable and fully functional DBMS, which is used by many top websites: Youtube, Twitter, Facebook, etc.

Because it has a wide audience, free and out of the box, it is a commonly used database in network data collection projects. In this article, we introduce how to store the collected data through MySQL.

Install MySQL

If you contact MySQL for the first time, you may feel a little troublesome. In fact, the installation method is as simple as installing other software. In the final analysis, MySQL is composed of a series of data files stored on your remote server or local computer, which contains all the information stored in the database.

Install MySQL for Windows, install MySQL for Ubuntu and install MySQL for MAC. The specific steps are as follows: Full platform installation of MySQL

Not too much explanation here, just follow the video operation.

Basic command

After the MySQL server is started, there are many ways to interact with the database server. Because many tools have graphical interfaces, you can manage the database without MySQL command line (or rarely use command line). Tools like phpMyAdmin and MySQL Workbench can easily view, sort and create databases. However, it is very important to master the command line to operate the database.

MySQL is case insensitive except for user-defined variable names. For example, select is the same as select, but it is customary to use uppercase for all MySQL keywords when writing MySQL statements. Most developers also like to use lowercase letters to represent the names of databases and data tables.

First, when logging into the MySQL database, there is no database to store data. We need to create a database:

CREATE DATABASE scraping_article DEFAULT CHARACTER SET UTF8 COLLATE UTF8_GENERAL_CI;

Because each MySQL instance can have multiple databases, you need to specify the name of the database before using a database:

USE scraping_article

From now on (until you close the MySQL link or switch to another database), all commands run in the new "scraping_article" database.

All operations look very simple. So the operation method of creating a new table in the database should be similar? We create a new table in the library to store the collected web page article data:

CREATE TABLE articles;

The result shows an error:

ERROR 1113 (42000): A table must have at least 1 column

Unlike a database, a MySQL data table must have a column, otherwise it cannot be created. In order to define fields (data columns) in mysql, we must also put the field definitions in a bracketed comma separated list after the create table < tablename > statement:

create table articles
(
    id                   int auto_increment
        primary key,
    title                varchar(64)                        null,
    body                 text                               null,
    summary              varchar(256)                       null,
    body_html            text                               null,
    create_time          datetime default CURRENT_TIMESTAMP null,
    time_updated         datetime                           null,
    link_text            varchar(128)                       null
);

Each field definition consists of three parts:

  • Name (id, title, body, etc.)
  • Data type (INT, VARCHAR, TEXT)
  • Other optional attributes (NOT NULL AUTO_INCREMENT)

At the end of the field definition list, you also need to define a "key". MySQL uses this primary key to organize the contents of the table, which is convenient for quick query later. In future articles, I will introduce how to use these primary keys to improve the query speed of the database, but now we can use the id column of the table as the primary key.

After the statement is executed, we can use DESCRIBE to view the structure of the data table:

+--------------+--------------+------+-----+-------------------+----------------+
| Field        | Type         | Null | Key | Default           | Extra          |
+--------------+--------------+------+-----+-------------------+----------------+
| id           | int(11)      | NO   | PRI | NULL              | auto_increment |
| title        | varchar(64)  | YES  |     | NULL              |                |
| body         | text         | YES  |     | NULL              |                |
| summary      | varchar(256) | YES  |     | NULL              |                |
| body_html    | text         | YES  |     | NULL              |                |
| create_time  | datetime     | YES  |     | CURRENT_TIMESTAMP |                |
| time_updated | datetime     | YES  |     | NULL              |                |
| link_text    | varchar(128) | YES  |     | NULL              |                |
+--------------+--------------+------+-----+-------------------+----------------+
8 rows in set (0.03 sec)

Now this table is an empty table. Let's insert the data as follows:

INSERT INTO articles(title,body,summary,body_html,link_text) VALUES ("Test page title","Test page body.","Test page summary.","<p>Test page body.</p>","test-page");

Here we need to note that although the articles table has 8 fields (id, title, body, summary, body_html, create_time, time_update, link_text), in fact, we only insert the data of 5 fields (title, body, summary, body_html, link_text). Because the id field is automatically incremented (MySQL increases 1 by default every time data is inserted), it is usually not processed. In addition, create_ The type of the time field is current_timestamp. By default, the timestamp is inserted.

Of course, we can also customize the field content to insert data:

INSERT INTO articles(id,title,body,summary,body_html,create_time,link_text) VALUES (4,"Test page title","Test page body.","Test page summary.","<p>Test page body.</p>","2021-11-20 15:51:45","test-page");

As long as the integer you define is not in the id field of the data table, it can be inserted into the data table. However, this is very bad; Unless absolutely necessary (for example, a row of data is missing in the program), let MySQL handle the id and timestamp fields by itself.

Now there are some data in the table. We can query these data in many ways. Here are some examples of SELECT statements:

SELECT * FROM articles WHERE id=1;

This statement tells Mysql to "select all data with ID equal to 2 from the articles table". This asterisk (*) is a wildcard, indicating all fields. This line of statement will display the contents of all fields that meet the condition (where id=1). If no row of ID here is equal to 1, an empty set will be returned. For example, the following case insensitive query will return all fields containing all rows of "test" in the title field (% symbol indicates MySQL string wildcard):

SELECT * FROM articles WHERE title LIKE "%test%"; 

But what if you have many fields and you only want to return some fields? Instead of using an asterisk, you can use the following method:

SELECT title, body FROM articles WHERE body LIKE "%test%";

In this way, only the title and body fields of all rows containing "test" will be returned.

The syntax of the DELETE statement is similar to that of the SELECT statement:

DELETE FROM articles WHERE id=1;

Since the data deletion of the database cannot be recovered, it is recommended to use SELECT to confirm the data to be deleted before executing the DELETE statement (the above deletion statement can be viewed by using SELECT * FROM articles WHERE id=1), and then replace SELECT * with DELETE. This will be a good habit. Many programmers have some sad memories of DELETE misoperation, and some terrible stories are that someone forgot to put WHERE in the sentence in a panic, resulting in the deletion of all customer data. Don't let this happen to you!

Another thing to introduce is the UPDATE statement:

UPDATE articles SET title="A new title", body="Some new body." WHERE id=4;

The above only uses the most basic MySQL statements to do some simple data query, creation and update.

Integration with Python

Python does not have built-in MySQL support tools. However, there are many open source that can be used to interact with MySQL, python 2 X and python 3 X versions are supported. The most famous one is PyMySQL.

We can use pip to install and execute the following commands:

python3 -m pip install PyMySQL

After installation, we can use PyMySQL package. If your MySQL server is running, you should be able to successfully execute the following command:

import pymysql
import os
from dotenv import load_dotenv


class DataSaveToMySQL(object):
    def __init__(self):
        # loading env config file
        dotenv_path = os.path.join(os.getcwd(), '.env')
        if os.path.exists(dotenv_path):
            load_dotenv(dotenv_path)

    conn = pymysql.connect(host=os.environ.get('MYSQL_HOST'), port=os.environ.get('MYSQL_PORT'),
                           user=os.environ.get('MYSQL_USER'), password=os.environ.get('MYSQL_PASSWORD'),
                           db=os.environ.get('MYSQL_DATABASES'))
    cur = conn.cursor()
    cur.execute("SELECT * FROM articles WHERE id=4;")
    print(cur.fetchone())
    cur.close()
    conn.close()

This program has two objects: connection object (conn) and cursor object (cur).

Connection / cursor mode is a common mode in database programming. When you first contact the database, it is sometimes difficult to distinguish the differences between the two modes. In addition to connecting to the database, the connection mode also sends database information, handles rollback operations (when a query or a group of queries are interrupted, the database needs to return to the initial state. Generally, transactions are used to realize state rollback), creates new cursors, and so on.

A connection can have many cursors. A cursor tracks a state information, such as the usage status of the database. If there are multiple databases and you need to write content to all databases, you need multiple cursors to process it. The cursor can also contain the result of the last query execution. By calling cursor functions, such as cur Fetchone(), you can get the query results.

Make sure you close the connections and cursors after you've used them. If it is not closed, it will lead to connection leak, resulting in a phenomenon of closing the connection, that is, the connection is no longer used, but the database cannot be closed because the database is not sure whether you want to continue to use it. This phenomenon will always consume database resources, so remember to close the connection after using up the database!

At the beginning, what you want to do is to save the collected data to the database. We continue to collect examples of blog posts to demonstrate how to implement data storage.

import pymysql
import os
from dotenv import load_dotenv

from config import logger_config
from utils import connection_util


class DataSaveToMySQL(object):
    def __init__(self):
        # loading env config file
        dotenv_path = os.path.join(os.getcwd(), '.env')
        if os.path.exists(dotenv_path):
            load_dotenv(dotenv_path)
        # MySQL config
        self._host = os.environ.get('MYSQL_HOST')
        self._port = int(os.environ.get('MYSQL_PORT'))
        self._user = os.environ.get('MYSQL_USER')
        self._password = os.environ.get('MYSQL_PASSWORD')
        self._db = os.environ.get('MYSQL_DATABASES')

        self._target_url = 'https://www.scrapingbee.com/blog/'
        self._baseUrl = 'https://www.scrapingbee.com'
        self._init_connection = connection_util.ProcessConnection()
        logging_name = 'store_mysql'
        init_logging = logger_config.LoggingConfig()
        self._logging = init_logging.init_logging(logging_name)

    def scrape_data(self):
        get_content = self._init_connection.init_connection(self._target_url)
        if get_content:
            parent = get_content.findAll("section", {"class": "section-sm"})[0]
            get_row = parent.findAll("div", {"class": "col-lg-12 mb-5 mb-lg-0"})[0]
            get_child_item = get_row.findAll("div", {"class": "col-md-4 mb-4"})
            for item in get_child_item:
                # Get title text
                get_title = item.find("a", {"class": "h5 d-block mb-3 post-title"}).get_text()
                # Get publishing time
                get_release_date = item.find("div", {"class": "mb-3 mt-2"}).findAll("span")[1].get_text()
                # Get article description
                get_description = item.find("p", {"class": "card-text post-description"}).get_text()
                self.article_save_mysql(title=get_title, description=get_description, release_date=get_release_date)
        else:
            self._logging.warning('No content of the article was obtained, please check!')


    def article_save_mysql(self, title, description, release_date):
        connection = pymysql.connect(host=self._host, port=self._port, user=self._user, password=self._password,
                                     db=self._db, charset='utf-8')
        with connection.cursor() as cursor:
            # Create a new record
            sql = "INSERT INTO articles (title,summary,create_time) VALUES (%s,%s,%s);"
            cursor.execute(sql, (title, description, release_date))

        # connection is not autocommit by default. So you must commit to save
        # your changes.
        connection.commit()

Here are a few points to note: first, charset='utf-8 'should be added to the connection string. This allows conn to treat all information sent to the database as UTF-8 encoding format (of course, provided that the default encoding of the database is set to UTF-8).

Then you need to pay attention to the article_save_mysql function. It has three parameters: title, description and release_date, and add these two parameters to an INSERT statement and execute it with a cursor, and then confirm it with a cursor. This is a good example of separating a cursor from a connection; When some database and database context information is stored in the cursor, the information needs to be transmitted to the database through connection confirmation, and then the information is inserted into the database.

The above code does not use try Finally statement to close the database. Instead, we used with() to close the database connection. In the previous issue, we also used with() to close the CSV file.

Although PyMySQL is not large in scale, there are some very practical functions, which are not demonstrated in this article. For details, please refer to Python DBAPI standard document.

The above is about saving the collected content to MySQL. All the code of this example is hosted in github.

github: https://github.com/sycct/Scra...

If you have any questions, welcome to github issue.

Topics: Python MySQL github crawler