User:Dsimic/Traffic stats calculation

Source: Wikipedia, the free encyclopedia.

Automated monthly statistics calculation

Below is a rather simple

page views statistics provided by the Pageview API in JSON format (that's a public API developed and maintained by the Wikimedia Foundation, see also its detailed REST API documentation), for a specified list of articles, and calculates their total monthly views and average views per day. The fetched page views statistics don't include spider- or bot-generated traffic. The program is intended to be run interactively from a command-line interface (CLI); instead of running it locally, on a machine capable of executing PHP scripts, you may also use some of the freely available online PHP development environments
.

Initially, this program used the page views statistics provided by stats.grok.se in JSON format, but that web service unfortunately became no longer updated around mid-January 2016, and it remains defunct as of June 2016. If needed, you can also have a look at that older version of the program code and documentation.

As pretty much everything else here on Wikipedia, I'm releasing this program code under the terms of the

CC BY-SA 3.0 license, so please feel free to use it and modify according to your needs. Of course, feel free to use my talk page to leave me a message
in case you have any questions, suggestions, bug reports, etc.

Source code

Before running this program, you need to modify the list of articles contained in the $articles variable (what's in the code below is the list of articles I've created or started), and to modify the month and year for which statistics are to be fetched and calculated, which are specified through the FETCH_MONTH and FETCH_YEAR constants, respectively. When the program is configured to calculate statistics for the current month, it takes into account only the whole/elapsed days; as a result, running the program on the first day of the month to calculate current month statistics isn't supported. Also, in case whole days are missing in the statistics data available from the Pageview API, the program doesn't count in such zero-page-views days when calculating the averages. The FETCH_PROJECT constant selects the encyclopedia: en.wikipedia.org is for the English Wikipedia, de.wikipedia.org is for the German Wikipedia, etc.

Just as a note, getting ready-to-run PHP code of this program is as easy as viewing the

bugfixes
are implemented.

<?php

define('FETCH_MONTH',   '01');                  // MM
define('FETCH_YEAR',    '2016');                // YYYY
define('FETCH_PROJECT', 'en.wikipedia.org');    // "en.wikipedia.org", "de.wikipedia.org", etc.

$articles = array('Stagefright (bug)',
                  'Row hammer',
                  'Address generation unit',
                  'UniDIMM',
                  'kdump (Linux)',
                  'kernfs (BSD)',
                  'kernfs (Linux)',
                  'ftrace',
                  'Android Runtime',
                  'WebScaleSQL',
                  'Intel X99',
                  'HipHop Virtual Machine',
                  'kpatch',
                  'kGraft',
                  'CoreOS',
                  'ARM Cortex-A17',
                  'Solid-state storage',
                  'Port Control Protocol',
                  'zswap',
                  'Emdebian Grip',
                  'ThinkPad 8',
                  'Laravel',
                  'OpenLMI',
                  'Open vSwitch',
                  'Distributed Overlay Virtual Ethernet',
                  'Management Component Transport Protocol',
                  'Buildroot',
                  'dm-cache',
                  'bcache',
                  'SATA Express',
                  'OpenZFS',
                  'List of Eurocrem packages',
                  'M.2',
                  'Eurocrem');

// ---------------------------------------------
// obviously, configurable stuff ends here
// ---------------------------------------------

define('CHUNK_SIZE',  10);    // articles, imposed by the Pageview API rate limit (see below)
define('CHUNK_SLEEP', 1);     // seconds, also related to the API rate limit

define('EXIT_SUCCESS', 0);    // program exit codes
define('EXIT_FAILURE', 1);

set_time_limit(0);
ini_set('memory_limit', 67108864);
ini_set('default_socket_timeout', 90);

// a few short helper functions

function plural_output($value, $unit) {
    return (number_format($value) . " {$unit}" . ((abs($value) != 1) ? 's' : ''));
}

function progress_message($message = '.') {
    static $last_message = null;

    $now     = microtime(true);
    $ret_val = false;

    if (($last_message === null) ||
        (($now - $last_message) > 0.5)) {    // one message every 0.5 seconds
        echo($message);

        $last_message = $now;
        $ret_val      = true;    // the message has been printed
    }

    return ($ret_val);
}

// prepare the cURL handles for all articles

echo("\nFetching statistics data: ");

$start_time     = microtime(true);
$handles        = array();

$articles_total = count($articles);
$day_of_month   = @date('j');
$current_month  = (FETCH_MONTH == @date('m'));

if ($articles_total == 0) {    // a small sanity check
    echo("no articles specified!\n");
    exit(EXIT_FAILURE);
}

if ($current_month && ($day_of_month == 1)) {       // account only the whole days, also knowing
    echo("no elapsed days in current month!\n");    // that the Pageview API rejects invalid dates
    exit(EXIT_FAILURE);
}

$days_total  = !$current_month
               ? cal_days_in_month(CAL_GREGORIAN, FETCH_MONTH, FETCH_YEAR)
               : ($day_of_month - 1);

$fetch_range = FETCH_YEAR . FETCH_MONTH . '01/' .
               FETCH_YEAR . FETCH_MONTH . sprintf('%02d', $days_total);

for ($id = 0; $id < $articles_total; $id++) {
    $handles[$id] = curl_init();

    curl_setopt($handles[$id], CURLOPT_URL, 'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/' .
                                            FETCH_PROJECT . '/all-access/user/' .
                                            rawurlencode(ucfirst($articles[$id])) . "/daily/{$fetch_range}");

    curl_setopt($handles[$id], CURLOPT_HEADER, false);
    curl_setopt($handles[$id], CURLOPT_RETURNTRANSFER, true);
    curl_setopt($handles[$id], CURLOPT_SSL_VERIFYPEER, false);

    curl_setopt($handles[$id], CURLOPT_CONNECTTIMEOUT, 20);
    curl_setopt($handles[$id], CURLOPT_TIMEOUT, 60);
    curl_setopt($handles[$id], CURLOPT_DNS_CACHE_TIMEOUT, 3600);

    curl_setopt($handles[$id], CURLOPT_FORBID_REUSE, false);
    curl_setopt($handles[$id], CURLOPT_FRESH_CONNECT, false);
    curl_setopt($handles[$id], CURLOPT_MAXCONNECTS, 10);
    
    curl_setopt($handles[$id], CURLOPT_USERAGENT, 'https://en.wikipedia.org/wiki/User_talk:Dsimic');
}

progress_message();

// run the cURL handles in chunks because the Pageview API imposes a rate limit,
// which, as of June 1, 2016, is specified at 10 requests per second, although
// it seems to be happily handling *much* higher rates

$handle_all     = curl_multi_init();
$chunks         = ceil(1.0 * $articles_total / CHUNK_SIZE);
$output         = array();
$error_messages = array('Parsing JSON data failed' => -1);

$views_total    = 0;
$failures       = 0;
$days_available = array();
$php_version    = explode('.', phpversion(), 3);

if (($php_version[0] >= 5) &&    // available since PHP 5.5.0
    ($php_version[1] >= 5)) {
    curl_multi_setopt($handle_all, CURLMOPT_PIPELINING, true);
    curl_multi_setopt($handle_all, CURLMOPT_MAXCONNECTS, 10);
}

for ($chunk = 0; $chunk < $chunks; $chunk++) {    // fetch one chunk at a time
    $id_limit = min(($chunk + 1) * CHUNK_SIZE, $articles_total);

    for ($id = $chunk * CHUNK_SIZE; $id < $id_limit; $id++)    // all articles in this chunk
        curl_multi_add_handle($handle_all, $handles[$id]);

    do {    // fetch the articles stats data in JSON format...
        $status = curl_multi_exec($handle_all, $running);
        progress_message();
    } while (($status == CURLM_CALL_MULTI_PERFORM) || 
             ($running > 0));

    for ($id = $chunk * CHUNK_SIZE; $id < $id_limit; $id++) {    // ... and process it
        $json = curl_multi_getcontent($handles[$id]);

        if (($json == '') ||    // is the JSON Ok?
            (($json = json_decode($json, true)) === null) ||
            !array_key_exists('items', $json) ||
            !is_array($json['items'])) {

            ++$failures;

            if (($message = curl_error($handles[$id])) != '') {        // for some reason, curl_errno()
                if (!array_key_exists($message, $error_messages)) {    // always returns zero here
                    $errno = -1 * count($error_messages) - 1;
                    $error_messages[$message] = $errno;
                }
                else    // already seen
                    $errno = $error_messages[$message];
            }
            else    // below -1 are the cURL errors
                $errno = -1;

            $output[$id] = $errno;
        }
        else {    // fetched JSON data is Ok
            $views = 0;

            foreach ($json['items'] as $json_item) {
                $views += $json_item['views'];

                if ($json_item['views'] > 0)    // complete days may be missing
                    $days_available[$json_item['timestamp']] = true;
            }

            $views_total += $views;
            $output[$id]  = $views;
        }

        curl_multi_remove_handle($handle_all, $handles[$id]);
        curl_close($handles[$id]);

        progress_message();    // done with this chunk
    }

    if ($chunk != ($chunks - 1)) {    // don't sleep after the last chunk
        $message = '#';               // all this results in smooth progress messages
        $limit   = CHUNK_SLEEP * 4;

        for ($i = 0; $i <= $limit; $i++) {
            if (progress_message($message) === true)    // print only one "marker"
                $message = '.';

            usleep(250000);
        }
    }
}

curl_multi_close($handle_all);
echo(" done.\n\n");

// done fetching all chunks of the stats data, generate and print the output...

arsort($output, SORT_NUMERIC);

$error_messages = array_flip($error_messages);
$articles_ok    = $articles_total - $failures;
$first_error    = true;

foreach ($output as $id => $views)
    if ($views >= 0)
        echo("- {$articles[$id]}: total " . plural_output($views, 'view') . "\n");
    else {
        if ($first_error && ($articles_ok > 0)) {    // display an empty line before
            echo("\n");                              // the first failure message
            $first_error = false;
        }

        echo("> {$articles[$id]}: failure ({$error_messages[$views]})\n");
    }

// ... and the final summary

$days_missing = $days_total - count($days_available);
$month_name   = @date('F', @strtotime(FETCH_YEAR . '-' . FETCH_MONTH . '-01'));

$elapsed_time = microtime(true) - $start_time;
$elapsed_min  = intval($elapsed_time / 60);
$elapsed_sec  = round($elapsed_time - $elapsed_min * 60);

echo("\nDone, {$month_name} " . FETCH_YEAR . ' statistics for ' . plural_output($articles_ok, 'article') .
     ' fetched in ' . (($elapsed_min > 0)
                       ? (plural_output($elapsed_min, 'minute') . ' and ')
                       : '') .
     plural_output($elapsed_sec, 'second') . ".\n" .
     (($failures > 0)
      ? ('Fetching the views statistics failed for ' . plural_output($failures, 'article') . ".\n")
      : ''));

if ($days_total > $days_missing) {                                          // it's entirely possible that
    $views_daily = intval($views_total / ($days_total - $days_missing));    // all days were missing

    echo('Total ' . plural_output($views_total, 'view') . ', averaging in ' .
         plural_output($views_daily, 'view') . ' per day (' .
         plural_output($days_total, ($current_month ? 'whole ' : '') . 'day') .
         ' in ' . ($current_month ? 'the current' : 'that') . ' month' .
         (($days_missing > 0)
          ? (', with the statistics unavailable for ' . plural_output($days_missing, 'day'))
          : '') .
         ").\n");
} else {                                                                    // no statistics data
    echo('Sorry, no statistics data is available at the moment for ' .
         ($current_month ? 'the current' : 'that') .  " month.\n");

    $errno = ((($days_total != $days_missing) ? 10 : 0) +    // just in case, perform some additional
              (($views_total != 0) ? 20 : 0));               // sanity checks on the internal logic

    if ($errno > 0) {
        echo("\nInternal errors detected (error code: {$errno}), please report on " .
             "https://en.wikipedia.org/wiki/User_talk:Dsimic by providing complete program output.\n");

        exit(EXIT_FAILURE);
    }
}

exit(EXIT_SUCCESS);

?>

Output example

Below is an example of the output produced when the program from above is run. The program sorts the articles by their total page views in descending order, so the article that has received the largest number of page views is first in the printed list. In the Fetching statistics data line, dots (.) represent the progress updates during the processing of each article chunk, while the hash marks (#) represent the beginning of processing for each new article chunk. This chunking is necessary because the Pageview API imposes a rate limit on the API queries it receives, which, as of June 1, 2016, is specified at 10 requests per second.

Fetching statistics data: ...#.#.#. done.

- M.2: total 64,598 views
- SATA Express: total 21,724 views
- Laravel: total 16,115 views
- Stagefright (bug): total 12,717 views
- CoreOS: total 11,593 views
- Android Runtime: total 9,493 views
- Intel X99: total 7,928 views
- HipHop Virtual Machine: total 5,944 views
- Row hammer: total 3,896 views
- Open vSwitch: total 3,769 views
- Solid-state storage: total 3,006 views
- dm-cache: total 2,044 views
- OpenZFS: total 2,011 views
- kpatch: total 1,927 views
- UniDIMM: total 1,924 views
- ARM Cortex-A17: total 1,758 views
- Port Control Protocol: total 1,621 views
- Buildroot: total 1,397 views
- bcache: total 1,323 views
- kdump (Linux): total 1,184 views
- zswap: total 1,052 views
- Eurocrem: total 1,032 views
- Management Component Transport Protocol: total 961 views
- ftrace: total 921 views
- Address generation unit: total 723 views
- kGraft: total 630 views
- kernfs (Linux): total 598 views
- ThinkPad 8: total 427 views
- Distributed Overlay Virtual Ethernet: total 409 views
- WebScaleSQL: total 317 views
- Emdebian Grip: total 284 views
- kernfs (BSD): total 280 views
- OpenLMI: total 229 views
- List of Eurocrem packages: total 99 views

Done, January 2016 statistics for 34 articles fetched in 7 seconds.
Total 183,934 views, averaging in 5,933 views per day (31 days in that month).