php uses selenium to crawl data

Posted by Zhadus on Sat, 05 Oct 2019 23:59:45 +0200

	## This article is an example for your reference only. If you need to reproduce it, please indicate the source.
	## For more information, please refer to the document (translated): [php selenium document] (https://www.kancloud.cn/wangking/selenium/234534)

1. Download Selenium Serve:
Download address: http://www.seleniumhq.org/download, you can download it yourself.
Put the compressed package into the project directory after downloading

2. Install Selenium through composer (you may need to turn over the wall):
composer require facebook/webdriver

3. Open CMD, enter the folder where the compressed package is located, and open Selenium Serve (this operation requires JAVE support, if you do not install JAVA, install it yourself)
java -jar selenium-server-standalone-2.42.2.jar

4. Code examples:

<?php

    // Introducing Class
    namespace Facebook\WebDriver;
    use Facebook\WebDriver\Remote\DesiredCapabilities;
    use Facebook\WebDriver\Remote\RemoteWebDriver;
    require_once('vendor/autoload.php');

    // Setting Character Set
    header("Content-Type: text/html; charset=UTF-8");

    $waitSeconds = 15;  //The time to wait for loading is usually 0-15 seconds. If it exceeds 15 seconds, an error will be reported.
    $host = 'http://localhost:4444/wd/hub'; // this is the default

    $capabilities = DesiredCapabilities::chrome();
    $driver = RemoteWebDriver::create($host, $capabilities, 5000);
    // navigate to 'http://docs.seleniumhq.org/'
    $driver->get('http://quote.eastmoney.com/sz000001.html');

    // Get the title of the website
    echo $driver->getTitle();
    
    //Close the browser
    $driver->quit();

?>

5. Code examples (crawling the data code in the list):

    function index(){

        // Setting Character Set
        set_time_limit(0);

		//The time to wait for loading is usually 0-15 seconds. If it exceeds 15 seconds, an error will be reported.
        $waitSeconds = 15;  
        $host = 'http://localhost:4444/wd/hub'; 
		
		// open a connection
        $capabilities = DesiredCapabilities::chrome();
        $driver = RemoteWebDriver::create($host, $capabilities, 5000);
		$driver->get('http://quote.eastmoney.com/changes/stocks/sz000001.html');
		// Determine if the element exists and stop if it does not exist
        if(!isElementExsit($driver, WebDriverBy::className('xp1'))){
    		exit();
    	}

        // Acquire unit price of stock
        $element = $driver->findElement(WebDriverBy::cssSelector("strong.xp1"));
        $xp1 = $element->getText();

        // Acquire the total market value of stocks
        $element = $driver->findElement(WebDriverBy::cssSelector("td#zsz"));
        $zsz = $element->getText();

		// Get a statistical list of stocks
        $driver->get("http://data.eastmoney.com/bbsj/stock000001/yjbb.html");
        $element = $driver->findElements(WebDriverBy::cssSelector("#dt_1>tbody>tr>td"));

		// Since the package is a one-time crawl and return of all data, you need to process the data yourself.
        $line = 0;
        $sharesInfo = array();
        $allSharesInfo = array();
        foreach ($element as $k => $v) {

			// Save arrays, one row for every 17 pieces of data
            $sharesInfo[] = $v->getText();
            if( ($k+1) % 17 == 0 ){
                $allSharesInfo['list'][] = str_replace( array("\n", "\r", "\r\n"), '-', $sharesInfo);
                
				// Generate new rows
                $sharesInfo = array();
                
                // Twelve lines per page are written to the file and the rest are discarded
                $line++;
                if( $line >= 12 ){
                    $allSharesInfo['zsz'] = $zsz;
                    $allSharesInfo['xp1'] = $xp1;
        			file_put_contents("000001.txt", json_encode($allSharesInfo)."\r\n");
                    $allSharesInfo = array();
                    break;
                }
            }
        }

        fclose ($handle);
        //Close the browser
        $driver->quit();
    }

Topics: Selenium PHP Java REST