Apache/cPanel Log Analysis using MySQL

Every once in a while I need to analyze log files from a web server. Sometimes people ask my what is sucking up all the bandwidth on their hosting plan, other times bots are driving them crazy so we try to identify easy ones to block. Sometimes we want to know how many times a certain URL, or group of URLs was loaded. The logs I work with usually come from LAMP servers running cPanel on a shared host but the techniques below are adaptable to any format by adjusting the import query.

Tools:

I usually use Sequel Pro to run queries. Even when writing the queries myself I prefer a GUI to the command line. It is much easier to inspect tables visually and my query history is saved. That said, we’re working with plain SQL here, (maybe with a bit of MySQL flavour), so any MySQL client will work.

You also need a MySQL server to process the queries. I use the same one as I use for development on my laptop, if you use MAMP, WAMP, or any other AMP you should have access to a local MySQL server.

Part 1: Setup

Before analyzing the data it must be imported, and before importing data there needs to be a table to put it in. Here is the CREATE TABLE statement I use:

CREATE TABLE `analysis_job` (
`ip` varchar(16) DEFAULT NULL,
`clientid` varchar(128) DEFAULT NULL,
`userid` varchar(128) DEFAULT NULL,
`timestamp` varchar(128) DEFAULT NULL,
`offset` varchar(128) DEFAULT NULL,
`request` varchar(250) DEFAULT NULL,
`status` int(128) DEFAULT '0',
`size` int(128) DEFAULT '0',
`referrer` varchar(250) DEFAULT NULL,
`useragent` varchar(250) DEFAULT NULL,
`ts` datetime DEFAULT NULL,
`uts` bigint(10) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

A few of these columns may need more explanation:

  • clientid: something Apache creates, almost alway blank
  • userid: something else Apache creates almost always blank
  • request: the contents of the HTTP request, includes the HTTP method, (usually GET or POST), the URL, and the HTTP version
  • status: the HTTP Status code the server sent
  • size: the size of the response. I believe this is in bytes.
  • ts: A timestamp with better formatting than the timestamp column to make it easier to do date operations, (we’ll compute this after importing the data)
  • uts: A Unix Timestamp representation of the ts column, handy for sorting.

Each line of the log file is a single HTTP request. The import statement tells the MySQL server that the fields are enclosed by double quotes and delimited by spaces, from left to right. This means that the timestamp gets broken into two columns, hence the timestamp and offset columns in the CREATE TABLE statement above.

And here’s a sample log entry for a request from Bingbot, (this is a single line from the log file, it is wrapped to multiple lines here):

127.0.0.1 - - [30/Nov/2014:04:00:45 -0800] "GET /directory/page.html HTTP/1.1" 200 9499 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

For this request the column values are:

ip:
127.0.0.1
clientid:
-
userid:
-
timestamp:
[30/Nov/2014:04:00:45
offset:
-0800]
request:
GET /directory/page.html HTTP/1.1
status:
200
size:
9499
referrer:
-
useragent:
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

The WHM/cPanel based servers that I have encountered always use this log format, but if the server you are working with is different you will have to modify the CREATE TABLE statement to match it.

Part 2: Importing Data

With the table set up it’s fairly quick & easy to import the data:

LOAD DATA LOCAL INFILE '/path/to/logfile/logfile'
INTO TABLE analysis_job
FIELDS TERMINATED BY ' ' ENCLOSED BY '"' LINES TERMINATED BY "\n";

This loads everything into the table. Now we have to extract an actual date from the timestamp column and turn it into a datetime for the ts column using MySQL’s STR_TO_DATE() function:

# to make the string into an actual datetime
UPDATE analysis_job SET ts = STR_TO_DATE(timestamp, '[%d/%b/%Y:%H:%i:%s');

Now there’s a small problem to solve: If the system timezone setting on the server is different from the timezone setting in MySQL on whatever computer you’re using as a database server, and if you want to be able to ask “what happened today in my timezone?” we need to shift the content of the ts column to match your timezone. You can see the UTC offset of the server by looking at the offset column. In my case the original server is on Pacific Time, and I’m on Eastern Time, so we need to add 3 hours to ts:

# adjust timezone
UPDATE analysis_job SET ts = DATE_ADD(ts, INTERVAL 3 HOUR);

If you want to have the Unix Timestamp version of ts available to you then you need to set it:

UPDATE analysis_job SET uts = UNIX_TIMESTAMP( ts );

Part 3: Analysis

Here’s the fun part. If you’re good with the ORDER BY and GROUP BY clauses of SQL, and their corresponding functions, you can find all sorts of interesting stuff. Here are some examples that I’ve used:

Find the total bandwidth used during the time your log file spans:

SELECT SUM(size) as totalsize, (SUM(size) / 1048576) as MB FROM analysis_job;

Or for the last week, (change the WHERE clause for different timeframes):

SELECT SUM(size) as totalsize, (SUM(size) / 1048576) as MB FROM analysis_job WHERE ts > DATE_SUB( NOW(), INTERVAL 1 WEEK );

Find the requests, (and therefore files), that use the most bandwidth:

SELECT request, SUM(size) as totalsize, (SUM(size) / 1048576) as MB FROM analysis_job GROUP BY request ORDER BY totalsize Desc;

Do you suspect there’s one computer out there hammering your site? Find the IP that’s using the most bandwidth:

SELECT ip, SUM(size) as totalsize, (SUM(size) / 1048576) as MB FROM analysis_job GROUP BY ip ORDER BY totalsize Desc;

Or the User-Agent:

SELECT useragent, SUM(size) as totalsize, (SUM(size) / 1048576) as MB FROM analysis_job GROUP BY useragent ORDER BY totalsize Desc;

Who is hammering you with the most requests?

SELECT ip, COUNT(*) as requests FROM analysis_job GROUP BY ip ORDER BY requests Desc;

Again, group by User-Agent to find the program, (assuming the User-Agent isn’t spoofed):

SELECT user agent, COUNT(*) as requests FROM analysis_job GROUP BY user agent ORDER BY requests Desc;

Any of these can be restricted by date, like we did with bandwidth above.

You can also zoom in on one User-Agent, for example:

SELECT * FROM analysis_job WHERE useragent='Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)';

Or certain file types:

SELECT * FROM analysis_job WHERE request LIKE '%.css%'; # CSS
SELECT * FROM analysis_job WHERE request LIKE '%.html%'; # HTML

Mix it up for detailed analysis. This finds the IPs that have sent the most hits in the past 5 hours that don’t declare themselves as bots and are accessing normal parts of the website, (not cPanel, stats packages, robots.txt, etc):

SELECT COUNT(*) AS cnt, ip FROM analysis_job
WHERE
ts > DATE_SUB(NOW(), INTERVAL 5 HOUR)
AND useragent NOT LIKE '%bot%'
AND useragent NOT LIKE '%spider%'
AND useragent NOT LIKE 'Mediapartners%'
AND request NOT LIKE '%/api/1/%'
AND request NOT LIKE '%/media/%'
AND request NOT LIKE '%/pub/%'
AND request NOT LIKE '%cpanel/%'
AND request NOT LIKE '%mint/%'
AND request NOT LIKE '%/robots.txt%'
GROUP BY ip
ORDER BY cnt Desc;

Part 4: Other Considerations

While this data can help find interesting traffic patterns and trends it is important to remember that IP addresses can be spoofed or proxied and User-Agents can be set to whatever a client wants so those values cannot be entirely trusted.

In addition, if you use CloudFlare, the IP address will almost always be the IP of a CloudFlare server unless your host has mod_cloudflare, or some equivalent, installed.

Even with these caveats server logs are an untapped information mine, and having a chunk of time in a database table makes it much easier to dig in that mine.

Really Purging a YouTube Iframe in Internet Explorer

I’ve been working on a website where we display YouTube videos in a modal overlay. This seems simple: absolutely position a div and centre YouTube’s iframe embed within it. When the user tires of the video remove the iframe and hide or remove the div. It works great, then I fired up Internet Explorer for testing.

In IE this is what appears in the Javascript console,  after the iframe is removed:

Errors in the IE Javascript Error Console.

These errors appear after html5player.js shouldn’t exist anymore.

With the iframe gone html5player.js and its code shouldn’t  exist anymore and therefore should not throw errors. It turns out that when an iframe is removed from a web page IE doesn’t completely purge the contents of the iframe from memory right away, if ever.

How do we fix it?

The errors occurred when using jQuery to set the innerHTML of the containing div to an empty string, so I started trying to explicitly remove the iframe’s DOM node using pure Javascript. Element.removeChild() the node instead of destroying it, so that didn’t kill the errors. The docs for IE’s removeNode() don’t specify if the node is held in memory, but the errors continued with it as well.

Some Googling turned up, (I forget where), that setting the iframe’s src attribute to an empty string might help. This makes sense because it’s the equivalent of navigating the iframe to a new page, which should unload any Javascript. Trying to set src=” worked, most of the time. When it didn’t work IE went crazy and became unusable.

In the end I took the approach of trying to wipe out the contents of the iframe as thoroughly as possible, first by changing the iframe’s src to an empty string, then by accessing the iframe’s DOM with Javascript and setting its document element to null. This did the trick. Here’s the function:

// note: modal is the div that contains the YouTube iframe
// jquery is used for some selecting, but the guts of the removal are pure Javascript

function purgeVideo() {
    // purge video must be called while the video is still displayed on the page, 
    // so doing a jQuery fadeOut(purgeVideo) may throw errors, (I think because 
    // fadeOut sets display:none before calling its callback).

    if( modal ) {
        // get the iframe, declare variables
        var ifr = $('iframe', modal), rifr;
        
        // if the iframe is found get the underlying DOM node from jQuery
        if ( ifr.length > 0 ) {
            rifr = ifr.get(0);
        }

        // make extra sure the DOM node exists before starting to work on it.
        if( rifr ) {
            // set src to '' to start removing YouTube and get around cross-domain access restrictions
            rifr.src = '';

            // now set the document element of the iframe to null, (I think this is IE-specific)
            if ( rifr.documentWindow ) {
                rifr.documentWindow.document = null;
                rifr.documentWindow = null;
            }

            // delete the iframe, (also removes it from DOM).
            delete rifr;
        }

        // this won't throw an error, even if the iframe is gone, (we're back to jQuery here).
        ifr.remove();

        // remove anything else, (titles, etc), from the modal overlay
        modal.html('');
    }

}

Xdebug is not friends with WordPress + create_function()

If you use Xdebug for debugging WordPress-based sites there’s something you should be aware of. If function created with PHP’s create_function() are hooked to WordPress actions or filters, any time $wp_filters is in the scope Xdebug sends invalid XML to the Xdebug client, (like your IDE). If the Xdebug client doesn’t deal with the invalid characters before attempting to parse the XML it will fail. IDEs deal with the parsing failure in different ways, SublimeTextXdebug doesn’t show the call stack or list of current variables but doesn’t crash entirely so debugging can continue. I don’t know how other IDEs handle the failure.

Why the XML is Invalid

create_function() makes a new function, gives it a random name that starts with a null character, and returns that function name as a string. When that string is passed to WordPress’s add_filter() or add_action() functions the name that create_function() returned is used as an array key in deep within the $wp_filters global. When Xdebug sends the list of variables back to an Xdebug client it is sent as XML and the null character encoded as � which is illegal in XML.

How to solve the Problem

Ideally Xdebug would only send valid XML to the Xdebug client, but the bug report has been open for a year and doesn’t seem to be high priority so developers should solve the problem by avoiding create_function()

Avoiding create_function() is Good Coding

create_function() eval()‘s the code that’s passed as the second argument, and eval() is something to be avoided, so create_function() is too. This Xdebug bug makes us avoid create_function() for actions and filters in order to keep Xdebug useful, with the side effect of making our code a bit more secure.

Adding to a WordPress page’s URL without changing the URL of its Child Pages.

I was recently asked to add an extra path component to the URL of a WordPress page for SEO reasons. It took some diving into the internals of WordPress. Here’s the setup:

  • WordPress is installed in the site root, at example.com
  • We are using a static homepage
  • The company blog is at example.com/blog/
  • I was asked to make the blog URL be example.com/blog/some-extra-keywords/ without changing the URL of the current and future blog posts, which are currently example.com/blog/post-name/

There are 3 things required to make this work:

  1. A filter to tell WordPress to process requests for /blog/some-extra-keywords/ as if they are for /blog/
  2. A filter to change permalinks from /blog/ to /blog/some-extra-keywords/ (Optional, but recommended).
  3. A redirect from /blog/ to /blog/some-extra-keywords/ (Also optional, but also recommended).

Treat /blog/some-extra-keywords/ requests like /blog/

We don’t want to change the URL of the blog page in WordPress’s Edit Page screen because that will change the URL of past and future all blog posts. We need to tell WordPress that a request for /blog/some-extra-keywords/ is a request for /blog/. This is done by adding a filter to WordPress’s request filter hook:


/**
* Make the URL /blog/some-extra-keywords/ respond as /blog/
* would.
* @param array $request The request array that WP generates.
* @return array The modified request array.
*/
function jb_filter_blog_url_request( $request ) {

if ( isset( $request['name'] ) && 'some-extra-keywords' == $request['name'] ) {
// it's a page we're requesting. If you're doing this to something other
// than a page, try setting $request['name'], and don't unset it 2 lines
// later
$request['pagename'] = 'blog';
unset( $request['name'] );
}

return $request;
}
add_filter( 'request', 'jb_filter_blog_url_request' );

Since WordPress identifies pieces of content by the last part of the URL, (the “name”), that’s all we have to change. In this case, because the blog is on a page, we had to set $request['pagename'].

Note: This is optional, but recommended. If you don’t do this you must do the next item, (redirect /blog/ to /blog/some-extra-keywords/), but you get the most SEO benefit if you do both.

With #1 it’s time to change links to the blog page to point to the new URL. The old URL still works, and you’ll be redirecting it in Step 3, but modifying the links cuts out a round-trip to the server, making your site faster, and it means that anytime a spider crawls the site it will find the correct URL.

Again, we create a filter function and hook it up. Because I’m doing this all to a page, I’m hooking on to the page_link filter, but the function also works with the post_link filter. In fact, I originally used the post_link filter, which doesn’t work for pages, and spent a while banging my head against a wall trying to figure out why it didn’t work.

By hooking into page_link or post_link you’ll modify the URL everywhere it is generated with get_permalink(), which is almost everywhere. Menus, lists of pages, and even the XML Sitemap that Yoast’s WordPress SEO plugin makes will be have the new URL. If you have typed the URL somewhere this filter will not change it, but that’s why you’ll set up a redirect in Step 3.

The code:

/**
* Modifies the blog URL when it's requested using get_permalink().
*
* Note that this filter is set up to work on both posts & pages, and the
* post_link and page_link filters don't pass quite the same parameters:
* - $post is an post object for post_link and a post ID for page_link, so we
* get the post object if $post is numeric
* - $dontuse is completely different between the two filters, but not needed,
* so we ignore it.
*
* @param string $url The URL to be filtered.
* @param mixed $post The post ID or post object that corresponds to $url
* @param mixed $leavename Differs depending on filter. Ignore.
* @return string The possibly-modified URL.
*/
function jb_modify_blog_url( $url, $post, $leavename ) {
$true = false;

// don't do it in the admin, I'm afraid the modified URL will get
// added to the URL slug field on the Edit Page screen, and get
// permanently added, with another copy of it being added every time
// the page is saved.
if( ! is_admin() ) {

if ( is_numeric( $post ) ) {
$post = get_post( $post );
}

if( 'blog' == $post->post_name ) {
$url .= 'some-extra-keywords/';
}
}
return $url;
}
add_filter( 'page_link', 'jb_modify_blog_url', 10, 3 );

The magic is in the third if condition: If the page name is ‘blog’ add the extra keywords to the URL.

Redirect from /blog/ to /blog/some-extra-keywords/

Note: This is optional, as well, but if you don’t do it the blog homepage will be available at both /blog and /blog/some-extra-keywords/ which could lead to a duplicate content penalty from Google.

We’re going to do a 301 Redirect to tell search engines that the blog is now at /blog/some-extra-keywords/. I’m going to do this in a .htaccess file because that way the web server doesn’t have to start PHP or WordPress.

If you’ve already got a rewrite section in your .htaccess file, add the following line to it:

RewriteRule ^blog/?$ /blog/some-extra-keywords/ [R=301,L]

This will redirect both /blog/ and /blog, (the first ? makes the slash right before it optional), to /blog/some-extra-keywords/. If you don’t have a rewrite section, add one:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /

RewriteRule ^blog/?$ /blog/whole-house-fan-energy-saving-tips/ [R=301,L]

</IfModule>

And that’s it.

Improving Versioning of WordPress Static Files

I was recently asked to improve the PageSpeed and YSlow scores of a WordPress-based website that I work on. One of the items that PageSpeed checks is "Don't include a query string in the URL for static resources," (under the "Leverage Proxy Caching" heading). I suspect this is most important when you are using a CDN, or otherwise run through a proxy server. WordPress puts a version in the query string of any JavaScript or CSS that's enqueued using wp_enqueue_script() or wp_enqueue_style().

Interestingly, PageSpeed doesn't seem to complain about the WordPress-included static resources that have a query string on the URL, only about static resources on 3rd-party URLs. I'm not sure why this is, but let's remove the query strings from the WordPress-included static resources anyway, while still maintaining a version somewhere in the URL so we can set far-future Expires headers.

The format of a WordPress version tag is ver={version number}, like this: example.com/wp-content/themes/theme-name/style.css?ver=2.0.5.4, but example.com/ver2.0.5.4/wp-content/themes/theme-name/style.css should be more cacheable. There are two steps to making this happen:

1) Use Apache's mod_rewrite to Serve the Correct File to the Updated URL

In step 2 we'll be moving the version tag to the beginning of the URL, we only need one RewriteRule to make it work:


# This goes in your .htaccess file, before the WordPress section
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /

## Cache-Busting ##
RewriteRule ^ver([\d\.\-a-zA-Z]+)/(.*) $2?ver=$1 [L] #match something with the version tag right at the beginning of the URL
## Done cache-busting ##

</IfModule>

The rule above matches any URL that starts with "ver" followed by any number of letters, numbers, periods, dashes, and underscores, captures that collection of characters up to the first forward slash, then it captures everything after the first forward slash. The URL is rewritten as $2?ver=$1 where $2 is everything after the version tag, (the original URL, with no version tag), and $1 is the version number that's added, internally, to the query string, but never appears in the browser.

Note: It is probably possible to leave off the ?ver=$1, but I haven't tried it yet.

Now example.com/wp-content/themes/theme-name/style.css?ver=2.0.5.4 and example.com/ver2.0.5.4/wp-content/themes/theme-name/style.css should both serve up the same style.css file.

2) Filter the URLs for Enqueued Scripts & Styles to Move the Version Tag from the Querystring into the Path

The URL for every enqueued Javascript is filtered by the script_loader_src filter, and the URL for every enqueued stylesheet is filtered by the style_loader_src filter, so we'll use these filters to put the new, improved URL in the HTML. This is an absolute URL, so we can take it apart, move the parts around so that it looks how we like. Here's the function I'm using to change the URLs, plus two lines to hook on to the filters:


/**
* Changes WordPress's default versioned URLs from putting the verion in the
* querystring to putting the version as part of the file page, (requires
* .htaccess modification to work). If there's no version in the querystring
* one can be added by setting the $defaultversion.
* @param string $url The URL to change the location of the version.
* @param mixed $defaultversion A string default verion. Defaults to false, (
* no default version).
* @return string The new URL.
*/
function jb_better_versioned_url( $url, $defaultversion = false ) {

// don't bother for the admin
if ( ! is_admin() ) {

// parse the URL
$parts = parse_url( $url );

// and the querystring
$query = wp_parse_args( $parts['query'], array() );

// check if there's a version in the querystring. If so, do more.
if ( isset( $query['ver'] ) || false !== $defaultversion ) {

if ( isset( $query['ver'] ) ) {
// prepend "/ver" + {version value} to the *path*
$parts['path'] = '/ver' . $query['ver'] . $parts['path'];

// unset the version in the querystring, since it's in the path now.
unset( $query['ver'] );
} else {
$parts['path'] = '/ver' . $defaultversion . $parts['path'];
}

// if this is a PHP file just let it be.
if( ! preg_match( '~\.php$~', $parts['path'] ) ) {

// start rebuilding the URL string
$url = $parts['scheme'] . '://' . $parts['host'] . $parts['path'];

// see if there's still anything in the query
if ( count( $query ) > 0 ) {

// rebuild query with whatever remains
$parts['query'] = build_query( $query );

// append it to the URL string
$url .= '?' . $parts['query'];
}
}
}
}

return $url;

}

// hook onto the filters
add_filter( 'style_loader_src', 'jb_better_versioned_url' );
add_filter( 'script_loader_src', 'jb_better_versioned_url' );

With these two modifications all of your JavaScript & CSS will have the version in the path instead of the querystring. In addition, any images, fonts, or anything else referenced with relative URLs from your stylesheets will have the version in the beginning of their URL, which will be rewritten and the file served properly. This way if you ever have caching problems you can bump a version number and browsers will pull new versions of all your static resources.

A Word of Warning

This involves considerable coding acrobatics, and adds complexity to your WordPress-based site, for speed improvements that may not be very large, and may not exist at all, so use this method at your own risk.

I have this code in production, but would love to hear the opinions of others on it, that's what the comments are for, or Twitter, or elsewhere - just let me know where the technique is discussed.