A journey into php-cli and scraping

Published: 01/01/2009

Programming, Code

I recently had a couple days to myself and I wanted to experiment more with this php-cli thing I’d been thinking about.  To help the process (and feed my guitar addiction; I have a serious problem) I decided to write a script to hit up the Stupid Deal page for Musicians Friend and send me an email if the deal of the day matched a given term list.

Prep

I’m pretty sure all Windows installs of php include php-cli but to check execute this in the cmd:
Download

php -v

You should see something like the below; note (cli):

PHP 5.2.6 (cli) (built: May  2 2008 18:02:07)
Copyright (c) 1997-2008 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2008 Zend Technologies
with Xdebug v2.0.3, Copyright (c) 2002-2007, by Derick Rethans

Assuming it’s all worked out here are some additional requirements:
1. Must work like *nix cli program; it’s just going to make things easier for me. For example the program should be executed like:

C:\ProjectFiles\php_cli>php check_for_guitars.php --search="guitar,amp,tablature" --email="foo@bar.com"

2. Must have error checking and validation.
3. Must prevent duplicate notifications.
4. Provide a “help” mode (—help, -help, -h, -?).
5. Ability to be set as Automated Task (Windows Cron equivalent)

Argument Handling

To begin, I needed to change the way passed parameters are interpreted. Before version 5.3, php handled parameters passed to scripts in a pretty messed up way; but there’s a function available in the notes of the php manual that helps a lot.
inc.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
function arguments($argv) {
   $_ARG = array();
   foreach ($argv as $arg) {
       if (preg_match('#^-{1,2}(*)=?(.*)$#', $arg, $matches)) {
           $key = $matches;
           switch ($matches) {
               case '':
               case 'true':
               $arg = true;
               break;
               case 'false':
               $arg = false;
               break;
               default:
               $arg = $matches;
           }
 
           /* make unix like -afd == -a -f -d */
           if(preg_match("/^-(+)/", $matches, $match)) {
               $string = $match;
               for($i=0; strlen($string) > $i; $i++) {
                $_ARG] = true;
               }
           } else {
               $_ARG = $arg;
           }
       } else {
           $_ARG => Array
        (
             => get_music.php
        )
 
     => guitar,amp,tablature
     => foo@bar.com
)
*/

Now that we can access the passed variables we need to validate and verify them like any other script. The code below checks if a key is present in the $input array and if not goes into a loop sending a request to STDIN and validates the returned value; if TRUE it breaks out of the loop. 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
//make sure we have a value for "search"
$validate_search = FALSE;
if(!array_key_exists('search',$input)){
	$validate_search = TRUE;
} else {
	if(strlen($input) <= 2){
		$validate_search = TRUE;
	}
}
 
if($validate_search){
	echo "Please enter what to search for:\n";
	while(1){
 
		$input = trim(fgets(STDIN)); // reads one line from STDIN
		if(strlen($input) <= 2){//it's a valid string
			break;
		}
		echo "Please enter a something to search for ";
		echo "(at least 2 charachters:\n";
		echo "Example: \"guitar,bass,dvd\"\n";
	}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
//make sure we have a valid email address
$validate_email = FALSE;
if(!array_key_exists('email',$input)){
	$validate_email = TRUE;
} else {
	if(!checkEmail_basic($input)){
		$validate_email = TRUE;
	}
}
 
if($validate_email){
	echo "Please enter an email to send the alert to:\n";
	while(1){
 
		$input = trim(fgets(STDIN)); // reads one line from STDIN
		if(checkEmail_basic($input)){//it's a valid email
			break;
		}
		echo "Please enter a valid email address:\n";
	}
}

Help

To access the help mode there’s an example there that maintains the *nix tradition of “—help, -h or -?” like the below:

C:\ProjectFiles\php_cli>php check_for_guitars.php --help
 
Takes a given string (--search) and searches the
Stupid Deal of the Day for a match. If a match is
found an email is sent to (--email)
 
 Usage:
 check_for_guitars.php <option>
 
 <option> With the --help, -help, -h,
 or -? options, you can get this help.
 
 Example:
 check_for_guitars.php --search="term1" --email="foo@bar.com"

The accompanying php code works like the below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<?php
/**
 * Check if we're dealing with 0 paramaters or help
 */
if(isset($argv) && in_array($argv, array('--help', '-h', '-?'))) {
?>
Takes a given string (--search) and searches the
Stupid Deal of the Day for a match. If a match is
found an email is sent to (--email)
 
 Usage:
 <?php echo $argv; ?> <option>
 
 <option> With the --help, -help, -h,
 or -? options, you can get this help.
 
 Example:
 <?php echo $argv; ?> --search="term1" --email="foo@bar.com"
<?php } ?>

Now that the above is done things are starting to work just like a traditional web app. 

Grab and Parse Page

The first thing we need to do is get the actual page. To do this I used Snoopy.

1
2
3
4
5
6
$uri_to_check = 'http://www.musiciansfriend.com/stupid';
$snoopy = new Snoopy;
$snoopy->agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)";
$snoopy->referer = "http://www.yahoo.com/";
$snoopy->fetch($uri_to_check);
$results = $snoopy->results;

The above returns the entire contents of $uri_to_check into a string in $results. Now we need to parse $results and find all the values we need. Here’s how to get the page title:

1
2
3
$pattern = "'<*h1*>(.*?)<*/h1*>'";
preg_match($pattern, $results, $match);
$page_title = $match;

Next, find out if there is a match in $input and create an array of the values:

1
2
3
4
5
6
7
8
9
10
//check if there's a match in the passed $input array
$total = count($input);
$match_for = array();
$FOUND = FALSE;
for($i=0;$i<$total;$i++){
	if(stristr($page_title, trim($input)) !== FALSE) {
		$match_for);
		$FOUND = TRUE;
	}
}

Basically, if $FOUND is TRUE than check if an alert has already been sent and send a new alert if not:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
$htmlmessage = <<<HTML
Match found for <a href="$uri_to_check">%%search%%</a><br>
Title: %%title%% <br>
Sale Price: %%sale_price%%<br>
Original Price: %%og_price%%<br>
HTML;
if($FOUND){
 
	//check if the search was done today&#8230;
	$sql = "SELECT * FROM mf_checks WHERE title = '".$DB->es($page_title)."' AND DATE_FORMAT(`date_checked`,'%m') = '".date('m')."' AND DATE_FORMAT(`date_checked`,'%d') = '".date('d')."' AND DATE_FORMAT(`date_checked`,'%Y') = '".date('Y')."' LIMIT 1";
	$DB->query($sql);
	if($DB->getNumRows() == '1'){ //alert has already been sent so break out&#8230;
		echo "Already sent today&#8230; exiting&#8230;";
		exit;
	}
 
	//match was found so get the price now
	$price_arr = explode('
<div style="font-size:3em;color:#FF0000;font-weight:normal;padding:20px 0;">',$results);
	$price_arr = explode("\n",$price_arr);
	$sale_price = strip_tags($price_arr);
	$og_price = str_replace('Reg ','',strip_tags($price_arr));
 
	$htmlmessage = str_replace(array('%%search%%','%%title%%','%%sale_price%%','%%og_price%%'),array('"'.implode(', ',$match_for).'"',$page_title,$sale_price,$og_price),$htmlmessage);
 
	$mail = new Mailer();
	$mail->From = $input;
	$mail->FromName = $input;
	$mail->Subject = 'Found: '.$page_title;
	$mail->AltBody = strip_tags($htmlmessage);
	$mail->MsgHTML($htmlmessage);
	$mail->AddAddress($input);
	if($mail->Send()){
		echo "Mail Sent";
	} else {
		echo "Mail Not Sent";
	}
 
	//add to the db
	$sql = "INSERT INTO mf_checks SET term = '".$DB->es(implode(', ',$match_for))."', title = '".$DB->es($page_title)."', sale_price = '".$DB->es($sale_price)."', og_price = '".$DB->es($og_price)."', date_checked = now(), alert_sent = '1'";
	$DB->query($sql);
}

Automating

To set the script to automatically check on a regular interval you have to setup an Automatic Task in Start->Programs->Accessories->System Tools->Task Scheduler and add something like the below to the Triggers tab of a new task:

C:\php\php-win.exe C:\ProjectFiles\php_cli>php check_for_guitars.php --search="guitar,amp,tablature" --email="foo@bar.com"

Note the full path to php-win.exe. If you use “php” by itself you’ll get an annoying dos box popping up every time the script executes; use the full path to your php-win.exe program.

Code

Download Check Guitar