Crawler series: collect through web forms and login windows

Posted by ashmo on Wed, 12 Jan 2022 03:01:18 +0100

In the last issue, we explained Data standardization For relevant content, first sort the frequency of words, and then convert some case to reduce the repeated content of 2-gram sequence.

When we really step out of the basic door of network data collection, the first problem we encounter may be: "how can I get the information behind the login window?" Today, the network is evolving towards the trend of page interaction, social media and user generated content. Forms and login windows are integral parts of many websites. However, these contents are relatively easy to handle.

So far, in the previous examples, when web crawlers interact with the servers of most websites through line data, they use the GET method of HTTP protocol to request information. In this article, we focus on the POST method, that is, push the information to the network server for storage and analysis.

Page form can basically be regarded as a way for users to submit POST requests, but this request method can be understood and used by the server. Just as the URL connection of a website helps users send GET requests, HTML forms can help users send POST requests. Of course, we can also create these requests ourselves with a little bit of hemp, and then submit them to the server through the web crawler.

Python Requests Library

Although the Python standard library can also control web forms, sometimes a little syntax sugar can make life sweeter. But when you want to do more than the basic GET requests that urllib can do, you can look at third-party libraries outside the Python standard library.

Requests Library It is such a Python third-party library that is good at handling complex HTTP requests, cookie s, headers (response headers and request headers).

The following is Kenneth Retiz, creator of Requests,'s comment on Python standard library tools:

Python standard library urlib2 provides you with most HTTP functions, but its API is very poor. This is because it has been established step by step over many years - different periods have to face different network environments. Therefore, in order to complete the simplest task, he needs to spend a lot of work (even write the whole method).

Things shouldn't be so complicated, let alone in Python.

Like any Python third-party library, the Requests library can also be installed with other third-party Python library managers, such as pip, or downloaded directly Requests library source code To install.

Submit a basic form

Most web forms consist of HTML fields, a submit button, and an "execution result" (the value of the form action) page that jumps after the form processing is completed. Although these HTML fields are usually composed of text content, they can also realize file upload or other non text content.

Because most mainstream websites will be on their robots Txt file indicates that crawlers are not allowed to access the login form. For relevant introduction, please refer to this article: Reptile series: moral hazard and legal responsibility brought by reptiles.

For example, the following is the source code of a form:

<form action="index.php?c=session&a=login" method=post name="form1">
    <div class="input_title">user name</div>
    <div class="input_box" style="margin-bottom: 10px;">
        <input id='username' name="username" />
    </div>
    <div class="input_title">password</div>
    <div class="input_box">
        <input type="password" name="passwd" />
    </div>
    <div class="login_button">
        <input type="submit" value="Sign in" />
    </div>
</form>

Here are a few points to note: first, it is very important that the names of the two input fields are , username , and , passwd. The name of the field determines the variable name to be transmitted to the server after the form is confirmed. If you want to simulate the behavior of submitting data from a form, you need to ensure that your variable name corresponds to the field name one by one.

We also need that the real behavior of the form actually occurs in index php? c=session&a=login. Any POST request of the form actually occurs on this page, not the page where the form itself is located. Remember: the purpose of HTML form is to help visitors of the website send reasonably formatted requests to the server for pages that do not appear. Don't spend too much time on the page where the form is located unless you want to study the requested design style.

Submitting a form using the Requests library requires only a few lines of code, including statements to import library files and print content:

import requests

params = {'username': 'admin', 'passwd': '5e_KR&pXJ9=J(c7d9P-twt9:'}

r = requests.post("http://www.test.com/admin/index.php?c=session&a=login", data=params)
print(r.text)

After the form is submitted, the program will return the source code of the execution page. These contents are as follows:

<frameset name="right" rows="64,9,*" cols="*" frameborder="no" border="0" framespacing="0">
    <frame src="?c=index&a=top" name="top" noresize="noresize" scrolling="no">
    <frame src="?c=index&a=controltop" name="controltop" noresize="noresize" scrolling="no">
    <frameset cols="170,*" frameborder="no" border="0" framespacing="0" name="main1">
        <frame src="?c=index&a=left" name="left" scrolling="auto" noresize>
        <frame src="?c=index&a=main" name="main" scrolling="yes">
    </frameset>
</frameset>
<noframes>
    <body bgcolor='#FFFFFF' text='#000000'>
    Your browser does not support frames!
    </body>
</noframes>

Because we submitted the content through Requests and did not submit it in the browser, the above prompt will appear, but we have successfully logged in. Later, when we need to use the browser to collect content, we will elaborate on this part.

This code can handle many simple forms. The following is the form code of a mail subscription, as follows:

<form data-formid="4ea5f424-2d32-44c7-b603-887d06d6bde0" novalidate>
    <input type="hidden" name="lifecyclestage" value="subscriber">
    <input type="hidden" name="content_language" value="English">
    <ul class="blog-subscribe-form__blog-list">
        <li style="display: none;">
            <input class="frequency" name="blog_default_blog_subscription" data-frequency="daily">
        </li>                           
        <li style="">
            <input type="checkbox" id="hs-marketing-blog-sticky" name="marketing_blog_subscribe_check_box_via_comments" data-frequency="daily" data-frequency-field="blog_default_blog_subscription" data-subscription-type-id="98684">
                <label for="hs-marketing-blog-sticky" class="blog-subscribe-form__checkbox">Marketing</label>
        </li>
    </ul>
    <ul class="hs-error-msgs inputs-list" role="alert">
        <li data-reactid=".hbspt-forms-0.1:$1.1:$email.3.$0">
            <label class="validation blog-list-validation"></label>
        </li>
    </ul>
    <div class="blog-subscribe-form__email-input-container">
        <div class="blog-subscribe-form__email-input">
            <label for="email-address">Email Address</label>
            <input type="email" placeholder="" id="email-address" name="email" value="" required>
        </div>    
    </div>
    <ul class="hs-error-msgs inputs-list" role="alert">
        <li data-reactid=".hbspt-forms-0.1:$1.1:$email.3.$0">
            <label class="validation email-validation"></label>
        </li>
    </ul>
    <input type="submit" value="Subscribe" class="blog-subscribe-form__button cta cta--primary-light cta--medium">
</form>

Although it's a little scary to see it for the first time, in most cases we only need to focus on two things:

  • The name of the data field you want to submit, such as e-mail above

  • The action attribute of the form, that is, the page that the website will display after the form is submitted

Radio buttons, check buttons, and other inputs

Obviously, not all pages are just a bunch of text fields and a submit button. The HTML standard provides a large number of available form fields: radio buttons, check buttons, drop-down options, etc. In HTML5, there are other controls, such as scroll bar (range input field), mailbox, date, etc. Custom Javascript fields can be described as omnipotent, and can realize color picker, calendar and any function that developers can think of.

No matter how complex the form's fields look, there are still only two things to focus on: field names and values. The field name can be easily obtained by viewing the source code and looking for the {name} attribute. Field values are sometimes complex and may be generated by Javascript before the form is submitted. The color finder is a strange form field. It may use #f5c26b# such a value.

If you are not sure about the data format of an input field value, there are some tools to track the contents of GET and POST requests that the browser is sending or receiving through the website. As mentioned earlier, the best and most direct way to track GET requests is to view the URL link of the website. If the URL link is like this:

https://pdf-lib.org/Home/SearchResult?Keyword=zabbix

The requested form might look like this:

<form action="/Home/SearchResult" method="get">
    <input type="text" value="" id="Keyword" name="Keyword">
    <input type="submit" value="" id="Search">
</form>

The following is a more complex form submission example:

If you encounter a POST form that looks complex and looks like the browser has passed those parameters to the server, the easiest way is to use the Chrome browser Review elements (inspector s) or developer tools.

summary

Due to space reasons, today only explained the basic forms, radio buttons, check boxes and other form inputs, as well as how to submit to the server through Requests.

In the next article, we will cover submitting files, images, processing login, cookie s, HTTP basic access authentication, and other form related issues.

The source code has been hosted in Github, address: https://github.com/sycct/Scrape_1_1.git

If you have any questions, please issue.

 

Topics: Python github crawler