Scraping
I lifted this code from Google Hacks about six months ago. It hasn't failed yet. Until today. As was pounded into my head over and over, scraping web pages is unreliable.
$ cat /home/galoot/bin/calcOkay.
#!/usr/bin/php5
preg_match_all('{<b>.+= (.+?)</b>}',
file_get_contents('http://www.google.com/search?q=' .
urlencode(join(' ', array_splice($argv, 1)))), $matches);
print str_replace('<font size=-2> </font>', ',',
"\n{$matches[1][0]}\n\n");
;
?>
$ calc 2000/364So far so good.
5.49450549
$ calc 2000/366That's right.
5.46448087
$ calc 2000/365Heh. That's from the 8th hit for those search terms.
{Carlo <b>...
Back to the drawing board.
3 Comments:
Try changing the first regex to
.+?= (.+?)
Sorry -- Blogger interpreted the bold tags as, well, bold tags. Wrap what I previously posted with bold tags.
The problem is that you want the shortest possible stretch of content bracketed by bold tags that contains an equal sign, and you were matching the longest possible stretch. On most pages that wasn't a problem, because there weren't any equal signs within bold tags besides the one you were looking for.
I used to scrape web pages for a living. Unreliable? Yeah. But sometimes it's the only way to get the job done.
This is why Larry's employed and I'm not.
Post a Comment
<< Home