SVG Mapping Reverse Crawling Example Exercise to extract text directly from SVG text pictures

Posted by rcity on Sat, 19 Feb 2022 23:34:48 +0100

If you are Xiao Bai, this data set can help you become a bull. If you have rich development experience, this data set can help you break through the bottleneck.
2022 Web Full Video Tutorial Front End Architecture H5 vue node Applet Video+Materials+Code+Interview Questions.

I've already described parsing data for CSS image offset and font backcrawl, and the links are as follows:

The last one focuses on the random order of font backcrawling and upgrade to outline, or even basic font shape, a few years later. In recent years, I don't need to understand so deeply that I can't read the direct collection, because at present no website's fonts are as complex as I imagine.

SVG Mapping Reverse Crawl Example

Today, I'm going to share a relatively simple example, SVG Text Picture Offset Reverse Crawl Data Extraction.

This anti-crawling mechanism is also much weaker than the first two, so few websites are still using this anti-crawling mechanism.

A practice site for SVG mapping backcrawling has been found here for you to play with. The website is: http://www.porters.vip/confusion/food.html

Analyse the DOM structure:

You can clearly see that the CSS picture background offset is used to locate and display the corresponding data, but this picture is a text picture in svg format.

View the svg as follows:

This means that the actual storage of this svg is still plain text.

Then we can completely parse the svg and locate the matched actual text in the svg according to the offset value of the background directly.

To make extracting css data easier, I use the selenium operation directly here, first open the destination web address:

from selenium import webdriver
browser = webdriver.Chrome()
url = 'http://www.porters.vip/confusion/food.html#'
browser.get(url)

Through food. As you can see from the code block at the beginning of line 4 of css, the CSS style for all labels positioned using svg text pictures is d[class^="vhk"]

So we take out the address of the svg by any element with that style:

d_tag = browser.find_element_by_css_selector('d[class^="vhk"]')
background_image_url = d_tag.value_of_css_property("background-image")
svg_url = background_image_url[5:-2]
svg_url


'http://www.porters.vip/confusion/font/food.svg'

Today, a new library, requests-html, is introduced for parsing purposes. The framework author is the author of requests.

You can install it using pip install requests-html. What are the advanced features of PIP install requests compared to previous requests? You can refer to the official manual for the address: https://requests-html.kennethreitz.org/

For me personally, the easiest thing to do now is to have built-in CSS and xpath selectors, and no additional packages to parse the data.

Today we will use the library's xpath selector directly to extract data from the svg file above. Looking at the xml structure of the svg content above, you can see that the x-attributes of each text node are consistent and increase in order.

Then I use the requests-html library to download and parse the code as follows:

from requests_html import HTMLSession

session = HTMLSession()
r = session.get(svg_url)
xs = []
ys = []
data = []
for text_tag in r.html.xpath(r"//text"):
    if not xs:
        xs.extend(map(int, text_tag.xpath(".//@x")[0].split()))
    ys.append(int(text_tag.xpath(".//@y")[0]))
    data.append(list(text_tag.xpath(".//text()")[0]))
print(xs)
print(ys)
print(data)


[14, 28, 42, 56, 70, 84, 98, 112, 126, 140, 154, 168, 182, 196, 210, 224, 238, 252, 266, 280, 294, 308, 322, 336, 350, 364, 378, 392, 406, 420, 434, 448, 462, 476, 490, 504, 518, 532, 546, 560, 574, 588, 602, 616, 630, 644, 658, 672, 686, 700, 714, 728, 742, 756, 770, 784, 798, 812, 826, 840, 854, 868, 882, 896, 910, 924, 938, 952, 966, 980, 994, 1008, 1022, 1036, 1050, 1064, 1078, 1092, 1106, 1120, 1134, 1148, 1162, 1176, 1190, 1204, 1218, 1232, 1246, 1260, 1274, 1288, 1302, 1316, 1330, 1344, 1358, 1372, 1386, 1400, 1414, 1428, 1442, 1456, 1470, 1484, 1498, 1512, 1526, 1540, 1554, 1568, 1582, 1596, 1610, 1624, 1638, 1652, 1666, 1680, 1694, 1708, 1722, 1736, 1750, 1764, 1778, 1792, 1806, 1820, 1834, 1848, 1862, 1876, 1890, 1904, 1918, 1932, 1946, 1960, 1974, 1988, 2002, 2016, 2030, 2044, 2058, 2072, 2086, 2100]
[38, 83, 120, 164]
[['1', '5', '4', '6', '6', '9', '1', '3', '6', '4', '9', '7', '9', '7', '5', '1', '6', '7', '4', '7', '9', '8', '2', '5', '3', '8', '3', '9', '9', '6', '3', '1', '3', '9', '2', '5', '7', '2', '0', '5', '7', '3'], ['5', '6', '0', '8', '6', '2', '4', '6', '2', '8', '0', '5', '2', '0', '4', '7', '5', '5', '4', '3', '7', '5', '7', '1', '1', '2', '1', '4', '3', '7', '4', '5', '8', '5', '2', '4', '9', '8', '5', '0', '1', '7'], ['6', '7', '1', '2', '6', '0', '7', '8', '1', '1', '0', '4', '0', '9', '6', '6', '6', '3', '0', '0', '0', '8', '9', '2', '3', '2', '8', '4', '4', '0', '4', '8', '9', '2', '3', '9', '1', '8', '5', '9', '2', '3'], ['6', '8', '4', '4', '3', '1', '0', '8', '1', '1', '3', '9', '5', '0', '2', '7', '9', '6', '8', '0', '7', '3', '8', '2']]

You can see that you have successfully extracted the data you need.

Let's start by demonstrating how to get an offset value using a phone number as an example:

import re
d_tags = browser.find_elements_by_css_selector('.more d[class^="vhk"]')
for d_tag in d_tags:
    position = d_tag.value_of_css_property("background-position")
    x, y = map(int, re.findall("d+", position))
    print(position, x, y)


-8px -15px 8 15
-274px -141px 274 141
-274px -141px 274 141
-176px -141px 176 141
-7px -15px 7 15
-288px -141px 288 141
-288px -141px 288 141
-7px -15px 7 15

This resolves the svg image offset value for each numeric position.

What if you understand these numbers? I will manually change the first digit to -8px -15px, as shown below:

The red part is the whole picture canvas, -8px means 8 pixels to the left, -15px means 15 pixels to the up, so the upper left part is clipped out, and then just get the picture of the specified length and width range, and the blue box part of the picture is displayed on the interface.

So how do we match to get the data we need?

The insertion point can then be obtained by a 2-point lookup, for example, for the number 1 in the upper right corner, which can be obtained as follows:

from bisect import bisect

data[bisect(ys, 15)][bisect(xs, 8)]


'1'

After refreshing the page, try again in batches:

import re

d_tags = browser.find_elements_by_css_selector('.more d[class^="vhk"]')
for d_tag in d_tags:
    position = d_tag.value_of_css_property("background-position")
    x, y = map(int, re.findall("d+", position))
    print(x, y, data[bisect(ys, y)][bisect(xs, x)])

From the result, each node matches the corresponding text number accurately:

Now consider replacing all the above svg nodes with corresponding text, then you can directly extract the text data as a whole (call a JavaScript script for node replacement):

import re

def parseAndReplaceSvgNode(d_tags):
    for d_tag in d_tags:
        position = d_tag.value_of_css_property("background-position")
        x, y = map(int, re.findall("d+", position))
        num = data[bisect(ys, y)][bisect(xs, x)]
        # Replace node with plain text
        browser.execute_script(f"""
            var element = arguments[0];
            element.parentNode.replaceChild(document.createTextNode("{num}"), element);
        """, d_tag)


d_tags = browser.find_elements_by_css_selector('.more d[class^="vhk"]')
parseAndReplaceSvgNode(d_tags)
phone = browser.find_element_by_css_selector('.more').text
phone


'Telephone: 400-51771'

After execution, the corresponding svg nodes are converted to corresponding text nodes:

This allows us to easily parse all the data we need.

Based on this, I have parsed the basic knowledge completely, and publish the complete processing code below:

import re
from requests_html import HTMLSession
from selenium import webdriver
from bisect import bisect


def parseAndReplaceSvgNode(d_tags):
    for d_tag in d_tags:
        position = d_tag.value_of_css_property("background-position")
        x, y = map(int, re.findall("d+", position))
        num = data[bisect(ys, y)][bisect(xs, x)]
        # Replace node with plain text
        browser.execute_script(f"""
            var element = arguments[0];
            element.parentNode.replaceChild(document.createTextNode("{num}"), element);
        """, d_tag)


browser = webdriver.Chrome()
url = 'http://www.porters.vip/confusion/food.html#'
browser.get(url)
d_tag = browser.find_element_by_css_selector('d[class^="vhk"]')
background_image_url = d_tag.value_of_css_property("background-image")
svg_url = background_image_url[5:-2]

session = HTMLSession()
html_session = session.get(svg_url)
xs = []
ys = []
data = []
for text_tag in html_session.html.xpath(r"//text"):
    if not xs:
        xs.extend(map(int, text_tag.xpath(".//@x")[0].split()))
    ys.append(int(text_tag.xpath(".//@y")[0]))
    data.append(list(text_tag.xpath(".//text()")[0]))
    
# Replace all svg nodes in the entire DOM as corresponding text at once
parseAndReplaceSvgNode(
    browser.find_elements_by_css_selector('d[class^="vhk"]'))
# Remove label a
element = browser.find_element_by_css_selector('.title a')
browser.execute_script("""
var element = arguments[0];
element.parentNode.removeChild(element);
""", element)
# Get Title
title = browser.find_element_by_class_name("title").text
# Get comments
comment = browser.find_element_by_class_name("comments").text
# per capita
avgPrice = browser.find_element_by_class_name('avgPriceTitle').text
# Taste, environment, service
comment_score_tags = browser.find_elements_by_css_selector(
    ".comment_score .item")
taste = comment_score_tags[0].text
environment = comment_score_tags[1].text
service = comment_score_tags[2].text
# address
address = browser.find_element_by_css_selector('.address .address_detail').text
# characteristic
characteristic = browser.find_element_by_css_selector(
    '.characteristic .info-name').text
# Telephone
phone = browser.find_element_by_class_name("more").text

print(title, comment, avgPrice, taste, environment,
      service, address, characteristic, phone)


Liuzhou spiral jelly powder 100 reviewers per capita:12 Flavor:8.7 Environmental Science:7.4 service:7.6 Shop characteristics of No. 28 Puxi Road, Zhongshan Avenue: crisp and cool bamboo shoots, hot and spicy red oil, chives and radishes, want to have a call after eating: 400-51771

You can see that each text is extracted perfectly and accurately.

Topics: Javascript Front-end html