Galootix: Scraping

2007-06-05

Scraping

I lifted this code from Google Hacks about six months ago. It hasn't failed yet. Until today. As was pounded into my head over and over, scraping web pages is unreliable.

$ cat /home/galoot/bin/calc 
#!/usr/bin/php5
preg_match_all('{<b>.+= (.+?)</b>}', 
  file_get_contents('http://www.google.com/search?q=' . 
     urlencode(join(' ', array_splice($argv, 1)))), $matches);
print str_replace('<font size=-2> </font>', ',',
  "\n{$matches[1][0]}\n\n");
;
?>

Okay.

$ calc 2000/364

5.49450549

So far so good.

$ calc 2000/366

5.46448087

That's right.

$ calc 2000/365

{Carlo <b>...

Heh. That's from the 8th hit for those search terms.

Back to the drawing board.

3 Comments:

At 5:49 PM, MeanMrMustard said...: Try changing the first regex to

.+?= (.+?)
At 6:11 PM, MeanMrMustard said...: Sorry -- Blogger interpreted the bold tags as, well, bold tags. Wrap what I previously posted with bold tags.

The problem is that you want the shortest possible stretch of content bracketed by bold tags that contains an equal sign, and you were matching the longest possible stretch. On most pages that wasn't a problem, because there weren't any equal signs within bold tags besides the one you were looking for.

I used to scrape web pages for a living. Unreliable? Yeah. But sometimes it's the only way to get the job done.
At 1:15 AM, Galoot said...: This is why Larry's employed and I'm not.

Galootix

2007-06-05

Scraping

3 Comments:

About Me

Previous Posts