A journey into php-cli and scraping
Published: 01/01/2009
Programming, Code
I recently had a couple days to myself and I wanted to experiment more with this php-cli thing I’d been thinking about. To help the process (and feed my guitar addiction; I have a serious problem) I decided to write a script to hit up the Stupid Deal page for Musicians Friend and send me an email if the deal of the day matched a given term list.
Prep
I’m pretty sure all Windows installs of php include php-cli but to check execute this in the cmd:
Download
php -v
You should see something like the below; note (cli):
PHP 5.2.6 (cli) (built: May 2 2008 18:02:07) Copyright (c) 1997-2008 The PHP Group Zend Engine v2.2.0, Copyright (c) 1998-2008 Zend Technologies with Xdebug v2.0.3, Copyright (c) 2002-2007, by Derick Rethans
Assuming it’s all worked out here are some additional requirements:
1. Must work like *nix cli program; it’s just going to make things easier for me. For example the program should be executed like:
C:\ProjectFiles\php_cli>php check_for_guitars.php --search="guitar,amp,tablature" --email="foo@bar.com"
2. Must have error checking and validation.
3. Must prevent duplicate notifications.
4. Provide a “help” mode (—help, -help, -h, -?).
5. Ability to be set as Automated Task (Windows Cron equivalent)
Argument Handling
To begin, I needed to change the way passed parameters are interpreted. Before version 5.3, php handled parameters passed to scripts in a pretty messed up way; but there’s a function available in the notes of the php manual that helps a lot.
inc.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | function arguments($argv) { $_ARG = array(); foreach ($argv as $arg) { if (preg_match('#^-{1,2}(*)=?(.*)$#', $arg, $matches)) { $key = $matches; switch ($matches) { case '': case 'true': $arg = true; break; case 'false': $arg = false; break; default: $arg = $matches; } /* make unix like -afd == -a -f -d */ if(preg_match("/^-(+)/", $matches, $match)) { $string = $match; for($i=0; strlen($string) > $i; $i++) { $_ARG] = true; } } else { $_ARG = $arg; } } else { $_ARG => Array ( => get_music.php ) => guitar,amp,tablature => foo@bar.com ) */ |
Now that we can access the passed variables we need to validate and verify them like any other script. The code below checks if a key is present in the $input array and if not goes into a loop sending a request to STDIN and validates the returned value; if TRUE it breaks out of the loop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | //make sure we have a value for "search" $validate_search = FALSE; if(!array_key_exists('search',$input)){ $validate_search = TRUE; } else { if(strlen($input) <= 2){ $validate_search = TRUE; } } if($validate_search){ echo "Please enter what to search for:\n"; while(1){ $input = trim(fgets(STDIN)); // reads one line from STDIN if(strlen($input) <= 2){//it's a valid string break; } echo "Please enter a something to search for "; echo "(at least 2 charachters:\n"; echo "Example: \"guitar,bass,dvd\"\n"; } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | //make sure we have a valid email address $validate_email = FALSE; if(!array_key_exists('email',$input)){ $validate_email = TRUE; } else { if(!checkEmail_basic($input)){ $validate_email = TRUE; } } if($validate_email){ echo "Please enter an email to send the alert to:\n"; while(1){ $input = trim(fgets(STDIN)); // reads one line from STDIN if(checkEmail_basic($input)){//it's a valid email break; } echo "Please enter a valid email address:\n"; } } |
Help
To access the help mode there’s an example there that maintains the *nix tradition of “—help, -h or -?” like the below:
C:\ProjectFiles\php_cli>php check_for_guitars.php --help Takes a given string (--search) and searches the Stupid Deal of the Day for a match. If a match is found an email is sent to (--email) Usage: check_for_guitars.php <option> <option> With the --help, -help, -h, or -? options, you can get this help. Example: check_for_guitars.php --search="term1" --email="foo@bar.com"
The accompanying php code works like the below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <?php /** * Check if we're dealing with 0 paramaters or help */ if(isset($argv) && in_array($argv, array('--help', '-h', '-?'))) { ?> Takes a given string (--search) and searches the Stupid Deal of the Day for a match. If a match is found an email is sent to (--email) Usage: <?php echo $argv; ?> <option> <option> With the --help, -help, -h, or -? options, you can get this help. Example: <?php echo $argv; ?> --search="term1" --email="foo@bar.com" <?php } ?> |
Now that the above is done things are starting to work just like a traditional web app.
Grab and Parse Page
The first thing we need to do is get the actual page. To do this I used Snoopy.
1 2 3 4 5 6 | $uri_to_check = 'http://www.musiciansfriend.com/stupid'; $snoopy = new Snoopy; $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"; $snoopy->referer = "http://www.yahoo.com/"; $snoopy->fetch($uri_to_check); $results = $snoopy->results; |
The above returns the entire contents of $uri_to_check into a string in $results. Now we need to parse $results and find all the values we need. Here’s how to get the page title:
1 2 3 | $pattern = "'<*h1*>(.*?)<*/h1*>'"; preg_match($pattern, $results, $match); $page_title = $match; |
Next, find out if there is a match in $input and create an array of the values:
1 2 3 4 5 6 7 8 9 10 | //check if there's a match in the passed $input array $total = count($input); $match_for = array(); $FOUND = FALSE; for($i=0;$i<$total;$i++){ if(stristr($page_title, trim($input)) !== FALSE) { $match_for); $FOUND = TRUE; } } |
Basically, if $FOUND is TRUE than check if an alert has already been sent and send a new alert if not:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | $htmlmessage = <<<HTML Match found for <a href="$uri_to_check">%%search%%</a><br> Title: %%title%% <br> Sale Price: %%sale_price%%<br> Original Price: %%og_price%%<br> HTML; if($FOUND){ //check if the search was done today… $sql = "SELECT * FROM mf_checks WHERE title = '".$DB->es($page_title)."' AND DATE_FORMAT(`date_checked`,'%m') = '".date('m')."' AND DATE_FORMAT(`date_checked`,'%d') = '".date('d')."' AND DATE_FORMAT(`date_checked`,'%Y') = '".date('Y')."' LIMIT 1"; $DB->query($sql); if($DB->getNumRows() == '1'){ //alert has already been sent so break out… echo "Already sent today… exiting…"; exit; } //match was found so get the price now $price_arr = explode(' <div style="font-size:3em;color:#FF0000;font-weight:normal;padding:20px 0;">',$results); $price_arr = explode("\n",$price_arr); $sale_price = strip_tags($price_arr); $og_price = str_replace('Reg ','',strip_tags($price_arr)); $htmlmessage = str_replace(array('%%search%%','%%title%%','%%sale_price%%','%%og_price%%'),array('"'.implode(', ',$match_for).'"',$page_title,$sale_price,$og_price),$htmlmessage); $mail = new Mailer(); $mail->From = $input; $mail->FromName = $input; $mail->Subject = 'Found: '.$page_title; $mail->AltBody = strip_tags($htmlmessage); $mail->MsgHTML($htmlmessage); $mail->AddAddress($input); if($mail->Send()){ echo "Mail Sent"; } else { echo "Mail Not Sent"; } //add to the db $sql = "INSERT INTO mf_checks SET term = '".$DB->es(implode(', ',$match_for))."', title = '".$DB->es($page_title)."', sale_price = '".$DB->es($sale_price)."', og_price = '".$DB->es($og_price)."', date_checked = now(), alert_sent = '1'"; $DB->query($sql); } |
Automating
To set the script to automatically check on a regular interval you have to setup an Automatic Task in Start->Programs->Accessories->System Tools->Task Scheduler and add something like the below to the Triggers tab of a new task:
C:\php\php-win.exe C:\ProjectFiles\php_cli>php check_for_guitars.php --search="guitar,amp,tablature" --email="foo@bar.com"
Note the full path to php-win.exe. If you use “php” by itself you’ll get an annoying dos box popping up every time the script executes; use the full path to your php-win.exe program.