Web Scraping With PHP - Complete Guide
Thanks to its extensive libraries and tools, PHP is a great language for building web scrapers . Designed specifically for web development, PHP handles web scraping tasks with ease and reliability.
There are many different methods for scraping websites using PHP, and you’ll explore a few different methods in this article. Specifically, you’ll learn how to scrape websites using curl,
file_get_contents
, Symfony BrowserKit, and Symfony’s Panther component. Additionally, you’ll learn about some common challenges you may face during web scraping and how to avoid them.
/wp:list-item
wp:list-item
/wp:list-item
wp:list-item
/wp:list-item
/wp:list
wp:heading
Web Scraping with PHP
/wp:heading
wp:paragraph
In this section, you’ll learn a few commonly used methods of web scraping both basic and complex/dynamic sites.
/wp:paragraph
wp:quote
wp:paragraph Please note: While we cover various methods in this tutorial, this is by no means an exhaustive list. /wp:paragraph
wp:paragraph
Please note: While we cover various methods in this tutorial, this is by no means an exhaustive list.
/wp:paragraph
/wp:quote
wp:heading {"level":3}
Prerequisites
/wp:heading
wp:paragraph
To follow along with this tutorial, you need the latest version of PHP and Composer, a dependency manager for PHP. This article was tested using PHP 8.1.18 and Composer 2.5.5.
/wp:paragraph
wp:paragraph
Once PHP and Composer are set up, create a directory named
php-web-scraping
and
cd
into it:
/wp:paragraph
wp:code
mkdir php-web-scraping
cd $_
/wp:code
wp:paragraph
You’ll work in this directory for the rest of the tutorial.
/wp:paragraph
wp:heading {"level":3}
curl
/wp:heading
wp:paragraph
curl is a near-ubiquitous low-level library and CLI tool written in C. It can be used to fetch the contents of a web page using HTTP or HTTPS. In almost all platforms, PHP comes with curl support enabled out of the box.
/wp:paragraph
wp:paragraph
In this section, you’ll scrape a very basic web page that lists countries by population based on estimates by the United Nations. You’ll extract the links in the menu along with the link texts.
/wp:paragraph
wp:paragraph
To start, create a file called
curl.php
and then initialize curl in that file with the
curl_init
function:
/wp:paragraph
wp:code
<?php
$ch = curl_init();
/wp:code
wp:paragraph
Then set the options for fetching the web page. This includes setting the URL and the HTTP method (GET, POST, etc.) using the function
curl_setopt
:
/wp:paragraph
wp:code
curl_setopt($ch, CURLOPT_URL, 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
/wp:code
wp:paragraph
In this code, you set the target URL to the web page and the method to
GET
. The
CURLOPT_RETURNTRANSFER
tells curl to return the HTML response.
/wp:paragraph
wp:paragraph
Once curl is ready, you can make the request using
curl_exec
:
/wp:paragraph
wp:code
$response = curl_exec($ch);
/wp:code
wp:paragraph
Fetching the HTML data is only the first step in web scraping. To extract data from the HTML response, you need to use several techniques. The simplest method is to use regular expressions for very basic HTML extraction. However, please note that you can’t parse arbitrary HTML with regex, but for very simple parsing, regex is enough.
/wp:paragraph
wp:paragraph
For example, extract the
<a>
tags, which have
href
and
title
attributes and contain a
<span>
:
/wp:paragraph
wp:code
if(! empty($ch)) {
preg_match_all(
'/<a href="([^"]*)" title="([^"]*)"><span>([^<]*)</span></a>/',
$response, $matches, PREG_SET_ORDER
);
foreach($matches as $link) {
echo $link[1] . " => " . $link[3] . "n";
}
}
/wp:code
wp:paragraph
Then release the resources by using the
curl_close
function:
/wp:paragraph
wp:code
curl_close($ch);
/wp:code
wp:paragraph
Run the code with the following:
/wp:paragraph
wp:code
php curl.php
/wp:code
wp:paragraph
/wp:paragraph
wp:image {"id":176475,"sizeSlug":"large","linkDestination":"none"}
/wp:image
wp:paragraph
curl gives you very low-level control over how a web page is fetched over HTTP/HTTPS. You can fine-tune the different connection properties and even add additional measures, such as proxy servers (more on this later), user agents, and timeouts.
/wp:paragraph
wp:paragraph
Additionally, curl is installed by default in most operating systems, which makes it a great choice for writing a cross-platform web scraper.
/wp:paragraph
wp:paragraph
However, as you saw, curl is not enough on its own, and you need an HTML parser to properly scrape data. curl also can’t execute JavaScript on a web page, which means you can’t scrape dynamic web pages and single-page applications (SPAs) with curl.
/wp:paragraph
wp:heading {"level":3}
file_get_contents
/wp:heading
wp:paragraph
The
file_get_contents
function is primarily used for reading the contents of a file. However, by passing an HTTP URL, you can fetch HTML data from a web page. This means
file_get_contents
can replace the usage of curl in the previous code.
/wp:paragraph
wp:paragraph
In this section, you’ll scrape the same page as before, but this time, the scraper will be more advanced, and you’ll be able to extract the names of all the countries from the table.
/wp:paragraph
wp:paragraph
Create a file named
file_get-contents.php
and start by passing a URL to
file_get_contents
:
/wp:paragraph
wp:code
<?php
$html = file_get_contents('https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)');
/wp:code
wp:paragraph
The
$html
variable now holds the HTML code of the web page.
/wp:paragraph
wp:paragraph
Similar to the previous example, fetching the HTML data is just the first step. To spice things up, use
libxml
to select elements using
XPath
selectors. To do that, you first need to initialize a
DOMDocument
and load the HTML into it:
/wp:paragraph
wp:code
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_clear_errors();
/wp:code
wp:paragraph
Here, you select the countries in the following order: the first
tbody
element, a
tr
element inside the
tbody
, the first
td
in the
tr
element, and an
a
with a
title
attribute inside the
td
element.
/wp:paragraph
wp:paragraph
The following code initializes a
DOMXpath
class and uses
evaluate
to select the element using the XPath selector:
/wp:paragraph
wp:code
$xpath = new DOMXpath($doc);
$countries = $xpath->evaluate('(//tbody)[1]/tr/td[1]//a[@title=true()]');
/wp:code
wp:paragraph
All that is left is to loop over the elements and print the text:
/wp:paragraph
wp:code
foreach($countries as $country) {
echo $country->textContent . "n";
}
/wp:code
wp:paragraph
Run the code with the following:
/wp:paragraph
wp:code
php file_get_contents.php
/wp:code
wp:image {"id":176486,"sizeSlug":"large","linkDestination":"none"}
/wp:image
wp:paragraph
As you can see,
file_get_contents
is simpler to use than curl and is often used to quickly fetch the HTML code of a web page. However, it suffers the same drawbacks as curl—you need an additional HTML parser, and you can’t scrape dynamic web pages and SPAs. Additionally, you lose the fine-tuned controls provided by curl. However, its simplicity makes it a good choice for scraping basic static sites.
/wp:paragraph
wp:heading {"level":3}
Symfony BrowserKit
/wp:heading
wp:paragraph
Symfony BrowserKit is a component of the Symfony framework that simulates the behavior of a real browser. This means you can interact with the web page like in an actual browser; for example, clicking on buttons/links, submitting forms, and going back and forward in history.
/wp:paragraph
wp:paragraph
In this section, you’ll visit the demlon blog , enter PHP in the search box, and submit the search form. Then you’ll scrape the article names from the result.
/wp:paragraph
wp:image {"id":176496,"sizeSlug":"large","linkDestination":"none"}
/wp:image
wp:paragraph
To use Symfony BrowserKit, you must install the BrowserKit component with Composer:
/wp:paragraph
wp:code
composer require symfony/browser-kit
/wp:code
wp:paragraph
You also need to install the
HttpClient
component to make HTTP requests over the internet:
/wp:paragraph
wp:code
composer require symfony/http-client
/wp:code
wp:paragraph
BrowserKit supports selecting elements using XPath selectors by default. In this example, you use CSS selectors. For that, you need to install the
CssSelector
component as well:
/wp:paragraph
wp:code
composer require symfony/css-selector
/wp:code
wp:paragraph
Create a file named
symfony-browserkit.php
. In this file, initialize
HttpBrowser
:
/wp:paragraph
wp:code
<?php
require "vendor/autoload.php";
use SymfonyComponentBrowserKitHttpBrowser;
$client = new HttpBrowser();
/wp:code
wp:paragraph
Use the
request
function to make a
GET
request:
/wp:paragraph
wp:code
$crawler = $client->request('GET', 'https://demlon.com/blog');
/wp:code
wp:paragraph
To select the form where the search button is, you need to select the button itself and use the
form
function to get the enclosing form. The button can be selected with the
filter
function by passing its ID. Once the form is selected, you can submit it using the
submit
function of the
Httpbrowser
class.
/wp:paragraph
wp:paragraph
By passing a hash of the values of the inputs, the
submit
function can fill up the form before it’s submitted. In the following code, the input with the name
q
has been given the value
PHP
, which is the same as typing
PHP
into the search box:
/wp:paragraph
wp:code
$form = $crawler->filter('#blog_search')->form();
$crawler = $client->submit($form, ['q' => 'PHP']);
/wp:code
wp:paragraph
The
submit
function returns the resulting page. From there, you can extract the article names using the CSS selector
.col-md-4.mb-4 h5
:
/wp:paragraph
wp:code
$crawler->filter(".col-md-4.mb-4 h5")->each(function ($node) {
echo $node->text() . "n";
});
/wp:code
wp:paragraph
Run the code with the following:
/wp:paragraph
wp:code
php symfony-browserkit.php
/wp:code
wp:image {"id":176505,"sizeSlug":"large","linkDestination":"none"}
/wp:image
wp:paragraph
While Symfony BrowserKit is a step up from the previous two methods in terms of interacting with web pages, it’s still limited because it can’t execute JavaScript. This means you can’t scrape dynamic websites and SPAs using BrowserKit.
/wp:paragraph
wp:heading {"level":3}
Symfony Panther
/wp:heading
wp:paragraph
Symfony Panther is another Symfony component that wraps around the BrowserKit component. However, Symfony Panther offers one major advantage: instead of simulating a browser, it executes the code in an actual browser using the WebDriver protocol to remotely control a real browser. This means you can scrape any website, including dynamic websites and SPAs.
/wp:paragraph
wp:paragraph
In this section, you’ll load the OpenWeather home page , type the name of your city in the search box, perform the search, and scrape the current weather of your city.
/wp:paragraph
wp:image {"id":176513,"sizeSlug":"large","linkDestination":"none"}
/wp:image
wp:paragraph
To get started, install Symfony Panther with Composer:
/wp:paragraph
wp:code
composer require symfony/panther
/wp:code
wp:paragraph
You also need to install
dbrekelmans/browser-driver-installer
, which can automatically detect the installed browser on your system and install the correct driver for it. Make sure you have either a Firefox- or a Chromium-based browser installed in your system:
/wp:paragraph
wp:code
composer require dbrekelmans/bdi
/wp:code
wp:paragraph
To install the appropriate driver in the
drivers
directory, run the
bdi
tool:
/wp:paragraph
wp:code
vendor/bin/bdi detect drivers
/wp:code
wp:paragraph
Create a file named
symfony-panther.php
and start by initializing a Panther client:
/wp:paragraph
wp:code
<?php
require 'vendor/autoload.php';
use SymfonyComponentPantherClient;
$client = Client::createFirefoxClient();
/wp:code
wp:quote
wp:paragraph
Note:
Depending on your browser, you may need to use
createChromeClient
or
createSeleniumClient
instead of
createFirefoxClient
.
/wp:paragraph
wp:paragraph
Note:
Depending on your browser, you may need to use
createChromeClient
or
createSeleniumClient
instead of
createFirefoxClient
.
/wp:paragraph
/wp:quote
wp:paragraph
Because Panther uses Symfony BrowserKit behind the scenes, the next codes are very similar to the code in the Symfony BrowserKit section.
/wp:paragraph
wp:paragraph
You start by loading the web page using the
request
function. When the page loads, it’s initially covered by a
div
with the
owm-loader
class, which shows the loading progress bar. You need to wait for this
div
to disappear before you start interacting with the page. This can be done using the
waitForStaleness
function, which takes a CSS selector and waits for it to be removed from the DOM.
/wp:paragraph
wp:paragraph
After the loading bar is removed, you need to accept the cookies so that the cookies banner is closed. For that, the
selectButton
function comes in handy, as it can search a button by its text. Once you have the button, the
click
function performs a click on it:
/wp:paragraph
wp:code
$client->request('GET', 'https://openweathermap.org/');
try {
$crawler = $client->waitForStaleness(".owm-loader");
} catch (FacebookWebDriverExceptionNoSuchElementException $e) {
}
$crawler->selectButton('Allow all')->click();
/wp:code
wp:quote
wp:paragraph
Note:
Depending on how fast the page loads, the loading bar may disappear before the
waitForStaleness
function runs. This throws an exception. That’s why that line has been wrapped in a try-catch block.
/wp:paragraph
wp:image {"id":176525,"sizeSlug":"large","linkDestination":"none"}
/wp:image
wp:paragraph
Note:
Depending on how fast the page loads, the loading bar may disappear before the
waitForStaleness
function runs. This throws an exception. That’s why that line has been wrapped in a try-catch block.
/wp:paragraph
wp:image {"id":176525,"sizeSlug":"large","linkDestination":"none"}
/wp:image
/wp:quote
wp:paragraph
Now it’s time to type
Kolkata
into the search bar. Select the search bar with the
filter
function and use the
sendKeys
function to provide input to the search bar. Then click on the
Search
button:
/wp:paragraph
wp:code
$crawler->filter('input[placeholder="Search city"]')->sendKeys('Kolkata');
$crawler->selectButton('Search')->click();
/wp:code
wp:paragraph
Once the button is selected, an autocomplete suggestion box pops up. You can use the
waitForVisibility
function to wait until the list is visible and then click on the first item using the combination of
filter
and
click
as before:
/wp:paragraph
wp:code
$crawler = $client->waitForVisibility(".search-dropdown-menu li");
$crawler->filter(".search-dropdown-menu li")->first()->click();
/wp:code
wp:image {"id":176535,"sizeSlug":"full","linkDestination":"none"}

/wp:image
wp:paragraph
Finally, use
waitForElementToContain
to wait for the results to load, and extract the current temperature using
filter
:
/wp:paragraph
wp:code
$crawler = $client->waitForElementToContain(".orange-text+h2", "Kolkata");
$temp = $crawler->filter(".owm-weather-icon+span.orange-text+h2")->text();
echo $temp;
/wp:code
wp:paragraph
Here, you’re waiting for the element with selector
.orange-text+h2
to contain
Kolkata
. This indicates that the results have been loaded.
/wp:paragraph
wp:paragraph
Run the code with the following:
/wp:paragraph
wp:code
php symfony-panther.php
/wp:code
wp:paragraph
Your output looks like this:
/wp:paragraph
wp:image {"id":176544,"sizeSlug":"large","linkDestination":"none"}
/wp:image
wp:heading
Web Scraping Challenges and Possible Solutions
/wp:heading
wp:paragraph
Even though PHP makes it easy to write web scrapers, navigating real-life scraping projects can be complex. Numerous situations can arise, presenting challenges that need to be addressed. These challenges may stem from factors such as the structure of the data ( eg pagination) or antibot measures taken by the owners of the website ( eg honeypot traps).
/wp:paragraph
wp:paragraph
In this section, you’ll learn about some common challenges and how to combat them.
/wp:paragraph
wp:heading {"level":3}
Navigating through Paginated Websites
/wp:heading
wp:paragraph
When scraping almost any real-life website, it’s likely that you’ll come across a situation where all the data isn’t loaded at once. Or in other words, the data is paginated. There can be two types of pagination:
/wp:paragraph
wp:list {"ordered":true}
All the pages are located at separate URLs. The page number is passed through a query parameter or a path parameter. For example,
example.com?page=3
or
example.com/page/3
.
/wp:list-item
wp:list-item
The new pages are loaded using JavaScript when the Next button is selected.
/wp:list-item
/wp:list
wp:paragraph
In the first scenario, you can load the pages in a loop and scrape them as separate web pages. For instance, using
file_get_contents
, the following code scrapes the first ten pages of an example site:
/wp:paragraph
wp:code
for($page = 1; $page <= 10; $page++) {
$html = file_get_contents('https://example.com/page/{$page}');
// DO the scraping
}
/wp:code
wp:paragraph
In the second scenario, you need to use a solution that can execute JavaScript, like Symfony Panther. In this example, you need to click on the appropriate button that loads the next page. Don’t forget to wait a little while for the new page to load:
/wp:paragraph
wp:code
for($page = 1; $page <= 10; $page++>) {
// Do the scraping
// Load the next page
$crawler->selectButton("Next")->click();
$client->waitForElementToContain(".current-page", $page+1)
}
/wp:code
wp:quote
wp:paragraph Note: You should substitute appropriate waiting logic that makes sense for the particular website that you’re scraping. /wp:paragraph
wp:paragraph
Note: You should substitute appropriate waiting logic that makes sense for the particular website that you’re scraping.
/wp:paragraph
/wp:quote
wp:heading {"level":3}
Rotating Proxies
/wp:heading
wp:paragraph
A proxy server acts as an intermediary between your computer and the target web server. It prevents the web server from seeing your IP address, thus preserving your anonymity.
/wp:paragraph
wp:paragraph
However, you shouldn’t rely on one single proxy server since it can be banned. Instead, you need to use multiple proxy servers and rotate through them . The following code provides a very basic solution where an array of proxies is used and one of them is chosen at random:
/wp:paragraph
wp:code
$proxy = array();
$proxy[] = '1.2.3.4';
$proxy[] = '5.6.7.8';
// Add more proxies
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://example.com");
curl_setopt($ch, CURLOPT_PROXY, $proxy[array_rand($proxy)]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
$result = curl_exec($ch);
curl_close($ch);
/wp:code
wp:heading {"level":3}
Handling CAPTCHAs
/wp:heading
wp:paragraph
CAPTCHAs are used by many websites to ensure the user is a human and not a bot. Unfortunately, this means your web scraper can get caught.
/wp:paragraph
wp:paragraph
CAPTCHAs can be very primitive, like a simple checkbox asking, “Are you human?” Or they can use a more advanced algorithm, like Google’s reCAPTCHA or hCaptcha. You can probably get away with primitive CAPTCHAs using basic web page manipulation ( eg checking a checkbox), but to battle advanced CAPTCHAs, you need a dedicated tool like 2Captcha. 2Captcha uses humans to solve CAPTCHAs. You simply need to pass the required details to the 2Captcha API, and it returns the solved CAPTCHA.
/wp:paragraph
wp:paragraph
To get started with 2Captcha, you need to create an account and get an API key.
/wp:paragraph
wp:paragraph
Install 2Captcha with Composer:
/wp:paragraph
wp:code
composer require 2captcha/2captcha
/wp:code
wp:paragraph
In your code, create an instance of
TwoCaptcha
:
/wp:paragraph
wp:code
$solver = new TwoCaptchaTwoCaptcha('YOUR_API_KEY');
/wp:code
wp:paragraph
Then use 2Captcha to solve CAPTCHAs:
/wp:paragraph
wp:code
// Normal captcha
$result = $solver->normal('path/to/captcha.jpg');
// ReCaptcha
$result = $solver->recaptcha([
'sitekey' => '6Le-wvkSVVABCPBMRTvw0Q4Muexq1bi0DJwx_mJ-',
'url' => 'https://mysite.com/page/with/recaptcha',
'version' => 'v3',
]);
// hCaptcha
$result = $solver->hcaptcha([
'sitekey' => '10000000-ffff-ffff-ffff-000000000001',
'url' => 'https://www.site.com/page/',
]);
/wp:code
wp:paragraph
Alternatively, you can see demlon’s CAPTCHA solving tool .
/wp:paragraph
wp:heading {"level":3}
Avoiding Honeypot Traps
/wp:heading
wp:paragraph
Honeypot traps are an antibot measure that mimics a service or network to lure in scrapers and crawlers to divert them from the actual target. Although honeypots are useful for prevention against bot attacks, they can be problematic for web scraping. You don’t want your scraper to get stuck in a honeypot.
/wp:paragraph
wp:paragraph
There are all kinds of measures you can take to avoid being lured into a honeypot trap. For instance, honeypot links are often hidden so that a real user doesn’t see them, but a bot can pick them up. To avoid the trap, you can try to avoid clicking on hidden links (links with
display: none
or
visibility: none
CSS properties).
/wp:paragraph
wp:paragraph
Another option is to rotate proxies so that if one of the proxy server IP addresses is caught in the honeypot and banned, you can still connect through other proxies.
/wp:paragraph
wp:heading
Conclusion
/wp:heading
wp:paragraph
Thanks to PHP’s superior library and frameworks, making a web scraper is easy. In this article, you learned how to do the following:
/wp:paragraph
wp:list
Scrape a static website using curl and regex
/wp:list-item
wp:list-item
Scrape a static website using
file_get_contents
and
libxml
/wp:list-item
wp:list-item
Scrape a static site using Symfony BrowserKit and submit forms
/wp:list-item
wp:list-item
Scrape a complex dynamic site using Symfony Panther
/wp:list-item
/wp:list
wp:paragraph
Unfortunately, while scraping using these methods, you learned that scraping with PHP comes with added complexities. For instance, you may need to arrange for multiple proxies and carefully construct your scraper to avoid honeypots.
/wp:paragraph
wp:paragraph
And this is where demlon comes in…
/wp:paragraph
wp:paragraph
About demlon proxies:
/wp:paragraph
wp:paragraph
Residential proxies : With over 150 million real IPs from 195 countries, demlon’s residential proxies enable you to access any website content regardless of location, while avoiding IP bans and CAPTCHAs.
/wp:paragraph
wp:paragraph
ISP proxies : With over 700,000 ISP IPs , leverage real static IPs from any city in the world, assigned by ISPs and leased to demlon for your exclusive use, for as long as you require.
/wp:paragraph
wp:paragraph
Datacenter proxies : With over 770,000 datacenter IPs, demlon’s datacenter proxy network is built of multiple IP types across the world, in a shared IP pool or for individual purchase.
/wp:paragraph
wp:paragraph
Mobile proxies : With over 7 million mobile IPs, demlon’s advanced Mobile IP Network offers the fastest and largest real-peer 3G/4G/5G IPs network in the world.
/wp:paragraph
wp:paragraph
Join the largest proxy network and get a free proxies trial.
/wp:paragraph
wp:group {"layout":{"type":"flex","flexWrap":"nowrap"}}
wp:acf/brd-button-block {"name":"acf/brd-button-block","data":{"brd_button_block_text":"Start free trial","_brd_button_block_text":"brd_button_block_text","brd_button_block_link":"#popup-155639","_brd_button_block_link":"brd_button_block_link","brd_button_block_icon":"brd_btn","_brd_button_block_icon":"brd_button_block_icon","brd_button_block_color":"light","_brd_button_block_color":"brd_button_block_color"},"mode":"preview"} /
wp:acf/brd-button-block {"name":"acf/brd-button-block","data":{"brd_button_block_text":"Start free with Google","_brd_button_block_text":"brd_button_block_text","brd_button_block_link":"","_brd_button_block_link":"brd_button_block_link","brd_button_block_icon":"google","_brd_button_block_icon":"brd_button_block_icon","brd_button_block_color":"brand","_brd_button_block_color":"brd_button_block_color"},"mode":"preview"} /
/wp:group
wp:group {"layout":{"type":"flex","flexWrap":"nowrap"}}
wp:acf/brd-button-block {"name":"acf/brd-button-block","data":{"brd_button_block_text":"Start free with Google","_brd_button_block_text":"brd_button_block_text","brd_button_block_link":"","_brd_button_block_link":"brd_button_block_link","brd_button_block_icon":"google","_brd_button_block_icon":"brd_button_block_icon","brd_button_block_color":"brand","_brd_button_block_color":"brd_button_block_color"},"mode":"edit"} /
/wp:group