Web Scraping With PHP

Thanks to its extensive libraries and tools, PHP is a great language for building web scrapers . Designed specifically for web development, PHP handles web scraping tasks with ease and reliability.

There are many different methods for scraping websites using PHP, and you’ll explore a few different methods in this article. Specifically, you’ll learn how to scrape websites using curl, file_get_contents , Symfony BrowserKit, and Symfony’s Panther component. Additionally, you’ll learn about some common challenges you may face during web scraping and how to avoid them.

/wp:list-item

wp:list-item

/wp:list-item

wp:list-item

/wp:list-item

/wp:list

wp:heading

/wp:heading

wp:paragraph

In this section, you’ll learn a few commonly used methods of web scraping both basic and complex/dynamic sites.

/wp:paragraph

wp:quote

wp:paragraph Please note: While we cover various methods in this tutorial, this is by no means an exhaustive list. /wp:paragraph

wp:paragraph

Please note: While we cover various methods in this tutorial, this is by no means an exhaustive list.

/wp:paragraph

/wp:quote

wp:heading {"level":3}

Prerequisites

/wp:heading

wp:paragraph

To follow along with this tutorial, you need the latest version of PHP and Composer, a dependency manager for PHP. This article was tested using PHP 8.1.18 and Composer 2.5.5.

/wp:paragraph

wp:paragraph

Once PHP and Composer are set up, create a directory named php-web-scraping and cd into it:

/wp:paragraph

wp:code

Text

mkdir php-web-scraping
cd $_

/wp:code

wp:paragraph

You’ll work in this directory for the rest of the tutorial.

/wp:paragraph

wp:heading {"level":3}

curl

/wp:heading

wp:paragraph

curl is a near-ubiquitous low-level library and CLI tool written in C. It can be used to fetch the contents of a web page using HTTP or HTTPS. In almost all platforms, PHP comes with curl support enabled out of the box.

/wp:paragraph

wp:paragraph

In this section, you’ll scrape a very basic web page that lists countries by population based on estimates by the United Nations. You’ll extract the links in the menu along with the link texts.

/wp:paragraph

wp:paragraph

To start, create a file called curl.php and then initialize curl in that file with the curl_init function:

/wp:paragraph

wp:code

Text

<?php
$ch = curl_init();

/wp:code

wp:paragraph

Then set the options for fetching the web page. This includes setting the URL and the HTTP method (GET, POST, etc.) using the function curl_setopt :

/wp:paragraph

wp:code

Text

curl_setopt($ch, CURLOPT_URL, 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)');

curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

/wp:code

wp:paragraph

In this code, you set the target URL to the web page and the method to GET . The CURLOPT_RETURNTRANSFER tells curl to return the HTML response.

/wp:paragraph

wp:paragraph

Once curl is ready, you can make the request using curl_exec :

/wp:paragraph

wp:code

Text

$response = curl_exec($ch);

/wp:code

wp:paragraph

Fetching the HTML data is only the first step in web scraping. To extract data from the HTML response, you need to use several techniques. The simplest method is to use regular expressions for very basic HTML extraction. However, please note that you can’t parse arbitrary HTML with regex, but for very simple parsing, regex is enough.

/wp:paragraph

wp:paragraph

For example, extract the <a> tags, which have href and title attributes and contain a <span> :

/wp:paragraph

wp:code

Text

if(! empty($ch)) {
    preg_match_all(
        '/<a href="([^"]*)" title="([^"]*)"><span>([^<]*)</span></a>/',
        $response, $matches, PREG_SET_ORDER
    );
    foreach($matches as $link) {
        echo $link[1] . " => " . $link[3] . "n";
    }
}

/wp:code

wp:paragraph

Then release the resources by using the curl_close function:

/wp:paragraph

wp:code

Text

curl_close($ch);

/wp:code

wp:paragraph

Run the code with the following:

/wp:paragraph

wp:code

Text

php curl.php

/wp:code

wp:paragraph

/wp:paragraph

wp:image {"id":176475,"sizeSlug":"large","linkDestination":"none"}

/wp:image

wp:paragraph

curl gives you very low-level control over how a web page is fetched over HTTP/HTTPS. You can fine-tune the different connection properties and even add additional measures, such as proxy servers (more on this later), user agents, and timeouts.

/wp:paragraph

wp:paragraph

Additionally, curl is installed by default in most operating systems, which makes it a great choice for writing a cross-platform web scraper.

/wp:paragraph

wp:paragraph

However, as you saw, curl is not enough on its own, and you need an HTML parser to properly scrape data. curl also can’t execute JavaScript on a web page, which means you can’t scrape dynamic web pages and single-page applications (SPAs) with curl.

/wp:paragraph

wp:heading {"level":3}

file_get_contents

/wp:heading

wp:paragraph

The file_get_contents function is primarily used for reading the contents of a file. However, by passing an HTTP URL, you can fetch HTML data from a web page. This means file_get_contents can replace the usage of curl in the previous code.

/wp:paragraph

wp:paragraph

In this section, you’ll scrape the same page as before, but this time, the scraper will be more advanced, and you’ll be able to extract the names of all the countries from the table.

/wp:paragraph

wp:paragraph

Create a file named file_get-contents.php and start by passing a URL to file_get_contents :

/wp:paragraph

wp:code

Text

<?php

$html = file_get_contents('https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)');

/wp:code

wp:paragraph

The $html variable now holds the HTML code of the web page.

/wp:paragraph

wp:paragraph

Similar to the previous example, fetching the HTML data is just the first step. To spice things up, use libxml to select elements using XPath selectors. To do that, you first need to initialize a DOMDocument and load the HTML into it:

/wp:paragraph

wp:code

Text

$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_clear_errors();

/wp:code

wp:paragraph

Here, you select the countries in the following order: the first tbody element, a tr element inside the tbody , the first td in the tr element, and an a with a title attribute inside the td element.

/wp:paragraph

wp:paragraph

The following code initializes a DOMXpath class and uses evaluate to select the element using the XPath selector:

/wp:paragraph

wp:code

Text

$xpath = new DOMXpath($doc);

$countries = $xpath->evaluate('(//tbody)[1]/tr/td[1]//a[@title=true()]');

/wp:code

wp:paragraph

All that is left is to loop over the elements and print the text:

/wp:paragraph

wp:code

Text

foreach($countries as $country) {
    echo $country->textContent . "n";
}

/wp:code

wp:paragraph

Run the code with the following:

/wp:paragraph

wp:code

Text

php file_get_contents.php

/wp:code

wp:image {"id":176486,"sizeSlug":"large","linkDestination":"none"}

/wp:image

wp:paragraph

As you can see, file_get_contents is simpler to use than curl and is often used to quickly fetch the HTML code of a web page. However, it suffers the same drawbacks as curl—you need an additional HTML parser, and you can’t scrape dynamic web pages and SPAs. Additionally, you lose the fine-tuned controls provided by curl. However, its simplicity makes it a good choice for scraping basic static sites.

/wp:paragraph

wp:heading {"level":3}

Symfony BrowserKit

/wp:heading

wp:paragraph

Symfony BrowserKit is a component of the Symfony framework that simulates the behavior of a real browser. This means you can interact with the web page like in an actual browser; for example, clicking on buttons/links, submitting forms, and going back and forward in history.

/wp:paragraph

wp:paragraph

In this section, you’ll visit the Bright Data blog , enter PHP in the search box, and submit the search form. Then you’ll scrape the article names from the result.

/wp:paragraph

wp:image {"id":176496,"sizeSlug":"large","linkDestination":"none"}

/wp:image

wp:paragraph

To use Symfony BrowserKit, you must install the BrowserKit component with Composer:

/wp:paragraph

wp:code

Text

composer require symfony/browser-kit

/wp:code

wp:paragraph

You also need to install the HttpClient component to make HTTP requests over the internet:

/wp:paragraph

wp:code

Text

composer require symfony/http-client

/wp:code

wp:paragraph

BrowserKit supports selecting elements using XPath selectors by default. In this example, you use CSS selectors. For that, you need to install the CssSelector component as well:

/wp:paragraph

wp:code

Text

composer require symfony/css-selector

/wp:code

wp:paragraph

Create a file named symfony-browserkit.php . In this file, initialize HttpBrowser :

/wp:paragraph

wp:code

Text

<?php
require "vendor/autoload.php";

use SymfonyComponentBrowserKitHttpBrowser;

$client = new HttpBrowser();

/wp:code

wp:paragraph

Use the request function to make a GET request:

/wp:paragraph

wp:code

Text

$crawler = $client->request('GET', 'https://brightdata.com/blog');

/wp:code

wp:paragraph

To select the form where the search button is, you need to select the button itself and use the form function to get the enclosing form. The button can be selected with the filter function by passing its ID. Once the form is selected, you can submit it using the submit function of the Httpbrowser class.

/wp:paragraph

wp:paragraph

By passing a hash of the values of the inputs, the submit function can fill up the form before it’s submitted. In the following code, the input with the name q has been given the value PHP , which is the same as typing PHP into the search box:

/wp:paragraph

wp:code

Text

$form = $crawler->filter('#blog_search')->form();

$crawler = $client->submit($form, ['q' => 'PHP']);

/wp:code

wp:paragraph

The submit function returns the resulting page. From there, you can extract the article names using the CSS selector .col-md-4.mb-4 h5 :

/wp:paragraph

wp:code

Text

$crawler->filter(".col-md-4.mb-4 h5")->each(function ($node) {
    echo $node->text() . "n";
});

/wp:code

wp:paragraph

Run the code with the following:

/wp:paragraph

wp:code

Text

php symfony-browserkit.php

/wp:code

wp:image {"id":176505,"sizeSlug":"large","linkDestination":"none"}

/wp:image

wp:paragraph

While Symfony BrowserKit is a step up from the previous two methods in terms of interacting with web pages, it’s still limited because it can’t execute JavaScript. This means you can’t scrape dynamic websites and SPAs using BrowserKit.

/wp:paragraph

wp:heading {"level":3}

Symfony Panther

/wp:heading

wp:paragraph

Symfony Panther is another Symfony component that wraps around the BrowserKit component. However, Symfony Panther offers one major advantage: instead of simulating a browser, it executes the code in an actual browser using the WebDriver protocol to remotely control a real browser. This means you can scrape any website, including dynamic websites and SPAs.

/wp:paragraph

wp:paragraph

In this section, you’ll load the OpenWeather home page , type the name of your city in the search box, perform the search, and scrape the current weather of your city.

/wp:paragraph

wp:image {"id":176513,"sizeSlug":"large","linkDestination":"none"}

/wp:image

wp:paragraph

To get started, install Symfony Panther with Composer:

/wp:paragraph

wp:code

Text

composer require symfony/panther

/wp:code

wp:paragraph

You also need to install dbrekelmans/browser-driver-installer , which can automatically detect the installed browser on your system and install the correct driver for it. Make sure you have either a Firefox- or a Chromium-based browser installed in your system:

/wp:paragraph

wp:code

Text

composer require dbrekelmans/bdi

/wp:code

wp:paragraph

To install the appropriate driver in the drivers directory, run the bdi tool:

/wp:paragraph

wp:code

Text

vendor/bin/bdi detect drivers

/wp:code

wp:paragraph

Create a file named symfony-panther.php and start by initializing a Panther client:

/wp:paragraph

wp:code

Text

<?php
require 'vendor/autoload.php';

use SymfonyComponentPantherClient;


$client = Client::createFirefoxClient();

/wp:code

wp:quote

wp:paragraph Note: Depending on your browser, you may need to use createChromeClient or createSeleniumClient instead of createFirefoxClient . /wp:paragraph

wp:paragraph

Note: Depending on your browser, you may need to use createChromeClient or createSeleniumClient instead of createFirefoxClient .

/wp:paragraph

/wp:quote

wp:paragraph

Because Panther uses Symfony BrowserKit behind the scenes, the next codes are very similar to the code in the Symfony BrowserKit section.

/wp:paragraph

wp:paragraph

You start by loading the web page using the request function. When the page loads, it’s initially covered by a div with the owm-loader class, which shows the loading progress bar. You need to wait for this div to disappear before you start interacting with the page. This can be done using the waitForStaleness function, which takes a CSS selector and waits for it to be removed from the DOM.

/wp:paragraph

wp:paragraph

After the loading bar is removed, you need to accept the cookies so that the cookies banner is closed. For that, the selectButton function comes in handy, as it can search a button by its text. Once you have the button, the click function performs a click on it:

/wp:paragraph

wp:code

Text

$client->request('GET', 'https://openweathermap.org/');
try {
    $crawler = $client->waitForStaleness(".owm-loader");
} catch (FacebookWebDriverExceptionNoSuchElementException $e) {

}
$crawler->selectButton('Allow all')->click();

/wp:code

wp:quote

wp:paragraph Note: Depending on how fast the page loads, the loading bar may disappear before the waitForStaleness function runs. This throws an exception. That’s why that line has been wrapped in a try-catch block. /wp:paragraph wp:image {"id":176525,"sizeSlug":"large","linkDestination":"none"} /wp:image

wp:paragraph

Note: Depending on how fast the page loads, the loading bar may disappear before the waitForStaleness function runs. This throws an exception. That’s why that line has been wrapped in a try-catch block.

/wp:paragraph

wp:image {"id":176525,"sizeSlug":"large","linkDestination":"none"}

/wp:image

/wp:quote

wp:paragraph

Now it’s time to type Kolkata into the search bar. Select the search bar with the filter function and use the sendKeys function to provide input to the search bar. Then click on the Search button:

/wp:paragraph

wp:code

Text

$crawler->filter('input[placeholder="Search city"]')->sendKeys('Kolkata');
$crawler->selectButton('Search')->click();

/wp:code

wp:paragraph

Once the button is selected, an autocomplete suggestion box pops up. You can use the waitForVisibility function to wait until the list is visible and then click on the first item using the combination of filter and click as before:

/wp:paragraph

wp:code

Text

$crawler = $client->waitForVisibility(".search-dropdown-menu li");
$crawler->filter(".search-dropdown-menu li")->first()->click();

/wp:code

wp:image {"id":176535,"sizeSlug":"full","linkDestination":"none"}

/wp:image

wp:paragraph

Finally, use waitForElementToContain to wait for the results to load, and extract the current temperature using filter :

/wp:paragraph

wp:code

Text

$crawler = $client->waitForElementToContain(".orange-text+h2", "Kolkata");
$temp = $crawler->filter(".owm-weather-icon+span.orange-text+h2")->text();

echo $temp;

/wp:code

wp:paragraph

Here, you’re waiting for the element with selector .orange-text+h2 to contain Kolkata . This indicates that the results have been loaded.

/wp:paragraph

wp:paragraph

Run the code with the following:

/wp:paragraph

wp:code

Text

php symfony-panther.php

/wp:code

wp:paragraph

Your output looks like this:

/wp:paragraph

wp:image {"id":176544,"sizeSlug":"large","linkDestination":"none"}

/wp:image

wp:heading

Web Scraping Challenges and Possible Solutions

/wp:heading

wp:paragraph

Even though PHP makes it easy to write web scrapers, navigating real-life scraping projects can be complex. Numerous situations can arise, presenting challenges that need to be addressed. These challenges may stem from factors such as the structure of the data ( eg pagination) or antibot measures taken by the owners of the website ( eg honeypot traps).

/wp:paragraph

wp:paragraph

In this section, you’ll learn about some common challenges and how to combat them.

/wp:paragraph

wp:heading {"level":3}

Navigating through Paginated Websites

/wp:heading

wp:paragraph

When scraping almost any real-life website, it’s likely that you’ll come across a situation where all the data isn’t loaded at once. Or in other words, the data is paginated. There can be two types of pagination:

/wp:paragraph

wp:list {"ordered":true}

All the pages are located at separate URLs. The page number is passed through a query parameter or a path parameter. For example, example.com?page=3 or example.com/page/3 .

/wp:list-item

wp:list-item

The new pages are loaded using JavaScript when the Next button is selected.

/wp:list-item

/wp:list

wp:paragraph

In the first scenario, you can load the pages in a loop and scrape them as separate web pages. For instance, using file_get_contents , the following code scrapes the first ten pages of an example site:

/wp:paragraph

wp:code

Text

for($page = 1; $page <= 10; $page++) {
    $html = file_get_contents('https://example.com/page/{$page}');
    // DO the scraping
}

/wp:code

wp:paragraph

In the second scenario, you need to use a solution that can execute JavaScript, like Symfony Panther. In this example, you need to click on the appropriate button that loads the next page. Don’t forget to wait a little while for the new page to load:

/wp:paragraph

wp:code

Text

for($page = 1; $page <= 10; $page++>) {
    // Do the scraping

    // Load the next page
    $crawler->selectButton("Next")->click();
    $client->waitForElementToContain(".current-page", $page+1)
}

/wp:code

wp:quote

wp:paragraph Note: You should substitute appropriate waiting logic that makes sense for the particular website that you’re scraping. /wp:paragraph

wp:paragraph

Note: You should substitute appropriate waiting logic that makes sense for the particular website that you’re scraping.

/wp:paragraph

/wp:quote

wp:heading {"level":3}

Rotating Proxies

/wp:heading

wp:paragraph

A proxy server acts as an intermediary between your computer and the target web server. It prevents the web server from seeing your IP address, thus preserving your anonymity.

/wp:paragraph

wp:paragraph

However, you shouldn’t rely on one single proxy server since it can be banned. Instead, you need to use multiple proxy servers and rotate through them . The following code provides a very basic solution where an array of proxies is used and one of them is chosen at random:

/wp:paragraph

wp:code

Text

$proxy      =   array();
$proxy[]    =   '1.2.3.4';
$proxy[]    =   '5.6.7.8';

// Add more proxies

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://example.com");
curl_setopt($ch, CURLOPT_PROXY, $proxy[array_rand($proxy)]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);


$result =   curl_exec($ch);
curl_close($ch);

/wp:code

wp:heading {"level":3}

Handling CAPTCHAs

/wp:heading

wp:paragraph

CAPTCHAs are used by many websites to ensure the user is a human and not a bot. Unfortunately, this means your web scraper can get caught.

/wp:paragraph

wp:paragraph

CAPTCHAs can be very primitive, like a simple checkbox asking, “Are you human?” Or they can use a more advanced algorithm, like Google’s reCAPTCHA or hCaptcha. You can probably get away with primitive CAPTCHAs using basic web page manipulation ( eg checking a checkbox), but to battle advanced CAPTCHAs, you need a dedicated tool like 2Captcha. 2Captcha uses humans to solve CAPTCHAs. You simply need to pass the required details to the 2Captcha API, and it returns the solved CAPTCHA.

/wp:paragraph

wp:paragraph

To get started with 2Captcha, you need to create an account and get an API key.

/wp:paragraph

wp:paragraph

Install 2Captcha with Composer:

/wp:paragraph

wp:code

Text

composer require 2captcha/2captcha

/wp:code

wp:paragraph

In your code, create an instance of TwoCaptcha :

/wp:paragraph

wp:code

Text

$solver = new TwoCaptchaTwoCaptcha('YOUR_API_KEY');

/wp:code

wp:paragraph

Then use 2Captcha to solve CAPTCHAs:

/wp:paragraph

wp:code

Text

// Normal captcha
$result = $solver->normal('path/to/captcha.jpg');

// ReCaptcha
$result = $solver->recaptcha([
    'sitekey' => '6Le-wvkSVVABCPBMRTvw0Q4Muexq1bi0DJwx_mJ-',
    'url'   => 'https://mysite.com/page/with/recaptcha',
    'version' => 'v3',
]);

// hCaptcha

$result = $solver->hcaptcha([
    'sitekey'   => '10000000-ffff-ffff-ffff-000000000001',
    'url'       => 'https://www.site.com/page/',
]);

/wp:code

wp:paragraph

Alternatively, you can see Bright Data’s CAPTCHA solving tool .

/wp:paragraph

wp:heading {"level":3}

Avoiding Honeypot Traps

/wp:heading

wp:paragraph

Honeypot traps are an antibot measure that mimics a service or network to lure in scrapers and crawlers to divert them from the actual target. Although honeypots are useful for prevention against bot attacks, they can be problematic for web scraping. You don’t want your scraper to get stuck in a honeypot.

/wp:paragraph

wp:paragraph

There are all kinds of measures you can take to avoid being lured into a honeypot trap. For instance, honeypot links are often hidden so that a real user doesn’t see them, but a bot can pick them up. To avoid the trap, you can try to avoid clicking on hidden links (links with display: none or visibility: none CSS properties).

/wp:paragraph

wp:paragraph

Another option is to rotate proxies so that if one of the proxy server IP addresses is caught in the honeypot and banned, you can still connect through other proxies.

/wp:paragraph

wp:heading

Conclusion

/wp:heading

wp:paragraph

Thanks to PHP’s superior library and frameworks, making a web scraper is easy. In this article, you learned how to do the following:

/wp:paragraph

wp:list

Scrape a static website using curl and regex

/wp:list-item

wp:list-item

Scrape a static website using file_get_contents and libxml

/wp:list-item

wp:list-item

Scrape a static site using Symfony BrowserKit and submit forms

/wp:list-item

wp:list-item

Scrape a complex dynamic site using Symfony Panther

/wp:list-item

/wp:list

wp:paragraph

Unfortunately, while scraping using these methods, you learned that scraping with PHP comes with added complexities. For instance, you may need to arrange for multiple proxies and carefully construct your scraper to avoid honeypots.

/wp:paragraph

wp:paragraph

And this is where Bright Data comes in…

/wp:paragraph

wp:paragraph

About Bright Data proxies:

/wp:paragraph

wp:paragraph

Residential proxies : With over 150 million real IPs from 195 countries, Bright Data’s residential proxies enable you to access any website content regardless of location, while avoiding IP bans and CAPTCHAs.

/wp:paragraph

wp:paragraph

ISP proxies : With over 700,000 ISP IPs , leverage real static IPs from any city in the world, assigned by ISPs and leased to Bright Data for your exclusive use, for as long as you require.

/wp:paragraph

wp:paragraph

Datacenter proxies : With over 770,000 datacenter IPs, Bright Data’s datacenter proxy network is built of multiple IP types across the world, in a shared IP pool or for individual purchase.

/wp:paragraph

wp:paragraph

Mobile proxies : With over 7 million mobile IPs, Bright Data’s advanced Mobile IP Network offers the fastest and largest real-peer 3G/4G/5G IPs network in the world.

/wp:paragraph

wp:paragraph

Join the largest proxy network and get a free proxies trial.

/wp:paragraph

wp:group {"layout":{"type":"flex","flexWrap":"nowrap"}}

wp:acf/brd-button-block {"name":"acf/brd-button-block","data":{"brd_button_block_text":"Start free trial","_brd_button_block_text":"brd_button_block_text","brd_button_block_link":"#popup-155639","_brd_button_block_link":"brd_button_block_link","brd_button_block_icon":"brd_btn","_brd_button_block_icon":"brd_button_block_icon","brd_button_block_color":"light","_brd_button_block_color":"brd_button_block_color"},"mode":"preview"} /

wp:acf/brd-button-block {"name":"acf/brd-button-block","data":{"brd_button_block_text":"Start free with Google","_brd_button_block_text":"brd_button_block_text","brd_button_block_link":"","_brd_button_block_link":"brd_button_block_link","brd_button_block_icon":"google","_brd_button_block_icon":"brd_button_block_icon","brd_button_block_color":"brand","_brd_button_block_color":"brd_button_block_color"},"mode":"preview"} /

/wp:group

wp:group {"layout":{"type":"flex","flexWrap":"nowrap"}}

/wp:group