
Scrape with PHP and CURL: Part 1 Login & Cookies
Part 1: Login and Set Cookies
This is a script that I wrote for a client that needed to access leads that they have purchased the rights to while then implementing across their various different sales channels. Often times it can be difficult to get the proper access from other systems or developers and it is easier to create your own back doors.
This sample script will need to be modified to meet your own needs and environment but this will at least give you an idea of what the code and process looks like to do.
Comments and Explanation have been Bold & Italicized. If you understand php you know that these characters: # or // is commented sections of the code. Code does not execute, used to read comments.
<?php
###################
#Current Working Script
#This script is for Login to set session and cookie.
#After doing so, call GetLeads.php to process the leads
###################
$email = ‘johndoe@myappbuilder.com’;
$password = ‘Builder’;
// initial login page which redirects to correct sign in page, sets some cookies
$URL = ‘https://login.examp.le/’;
//Initiatite the built in PHP Curl Function. http://php.net/manual/en/book.curl.php
$ch = curl_init();
//Curl is a command line browser originally developed in Unix/Linux. Think of it as a very simple version of Google Chrome. If you view HTML source of a page this is mostly what it reads is the HTML Source. Curl can accept cookies and do Security Sessions etc. Allows us to simulate what a browser does and have programming aspects over it.
// Below we set the different options for Curl to Initiate any of the critical ones I will comment after for what the curl process is doing
curl_setopt($ch, CURLOPT_URL, $URL); // We are calling the URL and initiating it.
curl_setopt($ch, CURLOPT_COOKIEJAR, ‘cookiesale.txt’); // Setting a cookie in the local directy that we are in. You may have to set a path and make sure you have write options to the file e.g. chmod 0644 cookiesale.txt
curl_setopt($ch, CURLOPT_COOKIEFILE, ‘cookiesale.txt’); // Reading the Cookie file once it is set
curl_setopt($ch, CURLOPT_USERAGENT, ‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0’); // This part can be critical, we are telling CURL to essentially spoof the User Agent (Browser Version) that we are a Mozilla 5.0 type of browser.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
//curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_STDERR, fopen(‘php://stdout’, ‘w’));
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$page = curl_exec($ch); // Execute the curl command options
//var_dump($page);exit; // Un-comment beginning of line to see the output of the variables from Curl.
// try to find the actual login form.
if (!preg_match(‘/<form name=”login”.*?<\/form>/is’, $page, $form)) {
die(‘Failed to find log in form!’);
}
$form = $form[0];
// find the action of the login form. Action is where the form submits to.
if (!preg_match(‘/action=(?:\’|”)?([^\s\'”>]+)/i’, $form, $action)) {
die(‘Failed to find login form url’);
}
$URL2 = ‘https://sales.examp.le/’.$action[1]; // this is our new post url
// find all hidden fields, security tokens which we need to send with our login. This assures all the expected & requiredVariables are posted simulating a login correctly.
$count = preg_match_all(‘/<input type=”hidden”\s*name=”([^”]*)”\s*value=”([^”]*)”/i’, $form, $hiddenFields);
$postFields = array();
// turn the hidden fields into an array
for ($i = 0; $i < $count; ++$i) {
$postFields[$hiddenFields[1][$i]] = $hiddenFields[2][$i];
}
// add our login values
$postFields[‘username’] = $email;
$postFields[‘password’] = $password;
$post = ”;
// convert array to string, form will not accept multipart/form-data, only application/x-www-form-urlencoded
foreach($postFields as $key => $value) {
$post .= $key . ‘=’ . urlencode($value) . ‘&’;
}
$post = substr($post, 0, -1);
// set additional curl options using our previous options
curl_setopt($ch, CURLOPT_URL, $URL2);
curl_setopt($ch, CURLOPT_REFERER, $URL);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
$page = curl_exec($ch); // Execute curl and make request to login
var_dump($page); // should be logged in. The output will be HTML CODE you will have to decipher code to see if it has login details.
?>
Need a Similar Solution programmed for you? Check out MyAppBuilder Developement as a Subscription plans that include up to 1 hour of code a day for only $129 a month!