I lifted this code from Google Hacks about six months ago. It hasn't failed yet. Until today. As was pounded into my head over and over, scraping web pages is unreliable.

$ cat /home/galoot/bin/calc 
preg_match_all('{<b>.+= (.+?)</b>}',
file_get_contents('http://www.google.com/search?q=' .
urlencode(join(' ', array_splice($argv, 1)))), $matches);
print str_replace('<font size=-2> </font>', ',',

$ calc 2000/364

So far so good.

$ calc 2000/366

That's right.

$ calc 2000/365

{Carlo <b>...
Heh. That's from the 8th hit for those search terms.

Back to the drawing board.


At 5:49 PM, Blogger MeanMrMustard said...

Try changing the first regex to

.+?= (.+?)

At 6:11 PM, Blogger MeanMrMustard said...

Sorry -- Blogger interpreted the bold tags as, well, bold tags. Wrap what I previously posted with bold tags.

The problem is that you want the shortest possible stretch of content bracketed by bold tags that contains an equal sign, and you were matching the longest possible stretch. On most pages that wasn't a problem, because there weren't any equal signs within bold tags besides the one you were looking for.

I used to scrape web pages for a living. Unreliable? Yeah. But sometimes it's the only way to get the job done.

At 1:15 AM, Blogger Galoot said...

This is why Larry's employed and I'm not.


Post a Comment

Links to this post:

Create a Link

<< Home