excessive traffic from facebookexternalhit bot

Per other answers, the semi-official word from Facebook is “suck it”. It boggles me they cannot follow Crawl-delay (yes, I know it’s not a “crawler”, however GET’ing 100 pages in a few seconds is a crawl, whatever you want to call it).

Since one cannot appeal to their hubris, and DROP’ing their IP block is pretty draconian, here is my technical solution.

In PHP, execute the following code as quickly as possible for every request.

define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit

if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && strpos(  $_SERVER['HTTP_USER_AGENT'], 'facebookexternalhit' ) === 0 ) {
    $fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
    if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
        $lastTime = fread( $fh, 100 );
        $microTime = microtime( TRUE );
        // check current microtime with microtime of last access
        if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
            // bail if requests are coming too quickly with http 503 Service Unavailable
            header( $_SERVER["SERVER_PROTOCOL"].' 503' );
            die;
        } else {
            // write out the microsecond time of last access
            rewind( $fh );
            fwrite( $fh, $microTime );
        }
        fclose( $fh );
    } else {
        header( $_SERVER["SERVER_PROTOCOL"].' 429' );
        die;
    }
}

You can test this from a command line with something like:

$ rm index.html*; wget -U "facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)" http://www.foobar.com/; less index.html

Improvement suggestions are welcome… I would guess their might be some concurrency issues with a huge blast.

Leave a Comment