Posts two days in a row. Why?

After many months of silence this is the second day in a row that a post has appeared here. Why?

The story starts on May 1 when a tweet by Tim Ferriss about Coach.me, caught my eye, (he’s an investor). I downloaded the app and am trying the “Habit of Waking Up On Time” challenge. One of the suggestions is to replace the habit of sleeping in or hitting snooze with a new habit, which brings me to John Saddington, the programmer behind Desk app. In November John posted 10 Days to a Better Blog! — a series of exercises to get people blogging. I’ve had it bookmarked, and blogging exercises are much more productive than sleeping, so I’m working through the exercises when I wake up. Today’s exercise is to post about why I write. I’ve read ahead and some of them are a bit navel-gazing, like this one.

So why do I write?

I write to avoid forgetting. There’s a sweet spot where I use a technique a few times a year, but not all the time, and it requires some almost-repeated code, often SQL. I like to get those things into a blog post so the queries are easy to find, such as analyzing web server logs in MySQL. Sometimes I solve difficult problems and I don’t want to forget how. My posts on transferring a Boot Camp partition to a new hard drive and transcoding live video streams from IP cameras are examples of these.

There’s a second reason I write: to rant, hopefully constructively. This doesn’t happen often, but it has, this is a blog after all.

The third reason I write is to try to refine ideas. This kind of writing rarely ends up anywhere public, but writing, then editing, helps me figure out things I’m having problems with, from code to business to gardening and beyond.

PhpStorm Review

Late last year I kept hearing a lot about PhpStorm 8, especially from WordPress people and wanted to try it out but the price was stopping me, (it turns out the price is probably worth paying – but it’s hard to know if the price is worth paying until you’ve already paid the price). Luckily for me WP e-Commerce has some licenses for core contributors, and apparently I count, (JetBrains, the people who make PhpStorm, provide licenses to open-source projects at no charge – thanks!).

One of the things that JetBrains asked in return for the licenses was that we try to write reviews of our experience, and after many months of using PhpStorm, here are my thoughts.

PhpStorm crashed several times in the 24 hours after I installed it, not a good start. Support suggested installing the EAP version, which is like a beta stream, which fixed the problem. Since then I’ve switched back & forth between the current release and EAP versions of PhpStorm, but I usually use the EAP, because I like shiny things.

Performance

I had heard that PhpStorm was fast, but it’s a Java app, (like Netbeans or Eclipse), so I had my doubts. When first installed there were some pretty big performance issues, especially related to scrolling. PhpStorm was exhibiting what I refer to as “that Java scroll lag” – one of the reasons I tend to avoid Java apps. There was also a problem that when scrolling with the mouse the text would disappear, which makes mouse scrolling nearly useless. Eventually I learned that there’s a version of PhpStorm for OS X that comes with a bundled, tweaked, Java Run-time Environment, and using that version of PhpStorm solved these problems.

When comparing PhpStorm’s performance to Sublime Text, (2 or 3), Sublime Text is still way faster. PhpStorm feels like a normal app, but ST3 has that amazing snappiness that Sublime Text users have come to love, and want to see everywhere. While PhpStorm’s performance is satisfactory, it’s not as blazing fast as the chatter led me to believe, but the combination of decent speed, and the other benefits that of using a real IDE, mean it helps me write cleaner code faster.

Setup

Like most coding programs, especially Java ones, PhpStorm hurt my eyes out of the box, but the look is tweakable enough to make it look how I want. There are two built-in themes, (“Default” and “Darcula”), and you can load 3rd-party colour schemes for the text editor. I think it might be possible to load 3rd-party themes as well but I haven’t investigated this. Switching to the Darcula theme, and switching to the Predawn colour scheme, (I can’t remember where I found the PhpStorm port, maybe it even came with it), and Inconsolata made PhpStorm feel like home.

Coming from ST3 I re-mapped some keyboard shortcuts to match ST3 make the transition easier and make it easier to switch between the two when needed.

I had to download extra drivers to connect to MySQL, which was kind of annoying, but I suspect might be license-related.

And Java. PhpStorm 8 requires Java 6, which is no longer installed on Macs because it’s so old, that’s why there’s a version offered for OS X with a bundled JRE. It’s possible to install Java 6 with a package from Apple, because we all love having several versions of Java installed, or it’s possible to tweak some plist files to let PhpStorm run with Java 8, (the current version). Fortunately, it looks like PhpStorm 9, (which I’m running the EAP of now), will do away with this requirement.

Code Inspection

This is an IDE, so it’s aware of any databases that get used. This is great most of the time, (the code inspector tells you if you mis-type a column name, for example), but the it also fusses when I build big SQL queries by concatenating strings. So far I’ve just ignored these complaints.

If I include a file like this in WP: include( get_stylesheet_directory() . ‘/myfile.php’); PhpStorm thinks that myfile.php doesn’t exist, even if it does. Since this is the “right way” to include files in WP themes, and PhpStorm advertises built-in WP knowledge, PhpStorm should understand it.

When making a commit to your preferred Source-Control system, (Git, right?), PhpStorm inspects the code you’re committing, and if there are errors or warnings it pops up a window saying so, and gives you a chance to fix these problems. This is a great way to slowly clean up the code of older projects with lots of legacy, (read: “written before I knew better”), code. However, it’s also kind of annoying when unfixable errors or warnings are included, like when I’ve concatenated together an SQL query.

I was working on a plugin that works with WooCommerce and PhpStorm complained that I was assigning the return value of a function that returns void to a variable. It turns out that WooCommerce has several PHP DocBlocks that state a function returns void when it actually returns a value. Thanks to a WP e-Commerce copy of PhpStorm for pointing out errors in WooCommerce!

Annoyances

If someone from Jetbrains reads this, consider these bug reports!

text-gone-bonkers

Sometimes the fonts go bonkers. When I was first setting PhpStorm to use Inconsolata they did, (that was using the release version), and a few times on the EAP I committed some code, and when the commit finished the display went crazy again.

PHPStorm Power Consumption

Power consumption – I’m not sure how it compares to Sublime Text, but it’s consuming more power than Photoshop today – #2 on the list behind Safari with 11 tabs open.

 

The first time I used the “Pull Up” refactor tool, which lets you pull methods up from a class into the parent class, I pulled up a bunch of static methods. PhpStorm didn’t actually go and find the places where the methods were called and change the calls, which I thought was the point of the “smart” IDE. Inheritance would mean that these methods would still be available, but there’s no reason to load the child class if everything I need is in the parent.

The auto-complete can be really bizarre. Often doesn’t pick the logical thing, or even the first thing in the suggestion list. For example, if I type “col” then press tab or enter autocomplete completes as “columns” not “color,” even though color is the first suggestion. It even insists on completing as “columns” if I type “colo” and press tab/enter. This drives me mad.

Search for any file, (I’ve got it mapped to Cmd+P to match Sublime Text): If a project has several files with the same name there doesn’t seem to be mapping that makes sense. To me it would make sense for the first option to be the one I used the most recently, or the one “closest” to whatever file I’m already looking at in the directory structure. If the files are sorted at the moment, it looks like it’s alphabetical, which should change.

Auto-change formatting: I try to use WordPress style in most of my PHP & Javascript – especially for WordPress-related work! WP style suggests putting a space around function parameters in both Javascript and PHP. However, if I’m doing chained calls in Javascript PhpStorm removes the space after the last function argument when I press the . key. For example:

$( 'body' )
// becomes
$( 'body').
// as soon as the period is typed

When PhpStorm 8.0.3 came out I started getting SSL error messages on startup. Apparently there’s a problem in some SSL-related part of the bundled JRE. 8.0.3 came out on February 13 and there’s still no fix. The current EAP doesn’t have this problem.

PhpStorm’s licensing scheme for open-source projects has changed this year, so I need a Jetbrains account, the licence is issued from WP e-Commerce to that account, then I’m supposed to sign in to the account with PhpStorm and it’ll activate. Unfortunately the SSL bug prevents PhpStorm from talking to the Jetbrains servers, so I can’t activate my PhpStorm license. Luckily the EAP comes with its own license, otherwise I would be out of luck. A bug that prevents customers from activating software is a very big bug.

Conclusion

PhpStorm is good, very good at some things, especially the IDE-type functionality I was looking for. Integrated code inspections & linting were a welcome surprise that is really helping my code look better and encouraging me to write better in-line documentation. There have been problems, though, that might have caused me to abandon PhpStorm I hadn’t pre-decided to give it a real chance. I nearly always use the EAP both because I like shiny things, and because the bugs in the release version of the software are annoying, or serious, enough to drive me into the arms of the EAP.

So, is PhpStorm worth the purchase price? Probably. It handles code completion, data-source awareness, most refactoring, and all its IDE-related functionality quickly and as promised, (or very nearly so). The problems tend to be in other areas, and from the state of the current PhpStorm 9 EAP, most of these problems will be solved soon.

Have Confidence in Algorithms

On Christmas Eve Eric Meyer wrote about Facebook’s Year in Review in a post titled Inadvertent Algorithmic Cruelty. Eric’s daughter died on her sixth birthday this year, and when Facebook suggested he share his Year in Review it featured photo of his daughter and the caption “It’s been a great year.” When he tried to dismiss the prompt the Facebook app crashed.

This is a huge oversight Facebook’s part. Some percentage of their over one billion users experience tragedy every year and shared a relevant photo on Facebook. By choosing the photo that generated the most responses as each user’s cover story it is highly likely that photos from tragic moments appear as the Year in Review prompt. Reading the comments on Eric’s post and on the mainstream media stories about his experience show that there are plenty of people who had a similar experience.

Other comments and responses show a lack of confidence that Facebook, and computer science in general, can do better. Comments say things like “the algorithmic approach… is almost certain to kick us at the end of a bad year” or a comment on the follow-up post: “the year in review algorithm brings things to the fore without understanding. Only humans can do that and the negative aspects only exist in the edge cases.” This is wrong. Facebook can do, and should have done, better, as can any algorithm-based system. In this case eliminating posts with a comment containing the text “for your loss” would have removed many, probably most, of the painful posts from the year in review. Surely Facebook is capable of doing sentiment analysis on each user’s year of posts then making a choice to prompt that user to create a Year in Review or not. If they want to really help maybe they can do something to help or console people who had a terrible year. After the media exposure this year I am sure that Facebook will make changes and avoid this problem next year, they may have fixed it already.

We can do better. When designing any system that will interact with people we can consider those in pain and try, at a minimum, to avoid deepening it. It’s not something that we have done well so far, but Eric is trying to change that, and so should we all.

Apache/cPanel Log Analysis using MySQL

Every once in a while I need to analyze log files from a web server. Sometimes people ask my what is sucking up all the bandwidth on their hosting plan, other times bots are driving them crazy so we try to identify easy ones to block. Sometimes we want to know how many times a certain URL, or group of URLs was loaded. The logs I work with usually come from LAMP servers running cPanel on a shared host but the techniques below are adaptable to any format by adjusting the import query.

Tools:

I usually use Sequel Pro to run queries. Even when writing the queries myself I prefer a GUI to the command line. It is much easier to inspect tables visually and my query history is saved. That said, we’re working with plain SQL here, (maybe with a bit of MySQL flavour), so any MySQL client will work.

You also need a MySQL server to process the queries. I use the same one as I use for development on my laptop, if you use MAMP, WAMP, or any other AMP you should have access to a local MySQL server.

Part 1: Setup

Before analyzing the data it must be imported, and before importing data there needs to be a table to put it in. Here is the CREATE TABLE statement I use:

 CREATE TABLE `analysis_job` (
`ip` varchar(16) DEFAULT NULL,
`clientid` varchar(128) DEFAULT NULL,
`userid` varchar(128) DEFAULT NULL,
`timestamp` varchar(128) DEFAULT NULL,
`offset` varchar(128) DEFAULT NULL,
`request` varchar(250) DEFAULT NULL,
`status` int(128) DEFAULT '0',
`size` int(128) DEFAULT '0',
`referrer` varchar(250) DEFAULT NULL,
`useragent` varchar(250) DEFAULT NULL,
`ts` datetime DEFAULT NULL,
`uts` bigint(10) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

A few of these columns may need more explanation:

  • clientid: something Apache creates, almost alway blank
  • userid: something else Apache creates almost always blank
  • request: the contents of the HTTP request, includes the HTTP method, (usually GET or POST), the URL, and the HTTP version
  • status: the HTTP Status code the server sent
  • size: the size of the response. I believe this is in bytes.
  • ts: A timestamp with better formatting than the timestamp column to make it easier to do date operations, (we’ll compute this after importing the data)
  • uts: A Unix Timestamp representation of the ts column, handy for sorting.

Each line of the log file is a single HTTP request. The import statement tells the MySQL server that the fields are enclosed by double quotes and delimited by spaces, from left to right. This means that the timestamp gets broken into two columns, hence the timestamp and offset columns in the CREATE TABLE statement above.

And here’s a sample log entry for a request from Bingbot, (this is a single line from the log file, it is wrapped to multiple lines here):

127.0.0.1 - - [30/Nov/2014:04:00:45 -0800] "GET /directory/page.html HTTP/1.1" 200 9499 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

For this request the column values are:

ip:
127.0.0.1
clientid:
userid:
timestamp:
[30/Nov/2014:04:00:45
offset:
-0800]
request:
GET /directory/page.html HTTP/1.1
status:
200
size:
9499
referrer:
useragent:
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

The WHM/cPanel based servers that I have encountered always use this log format, but if the server you are working with is different you will have to modify the CREATE TABLE statement to match it.

Part 2: Importing Data

With the table set up it’s fairly quick & easy to import the data:

#Import data
LOAD DATA LOCAL INFILE '/path/to/logfile/logfile' 
INTO TABLE analysis_job
FIELDS TERMINATED BY ' ' ENCLOSED BY '"' LINES TERMINATED BY "\n";

This loads everything into the table. Now we have to extract an actual date from the timestamp column and turn it into a datetime for the ts column using MySQL’s STR_TO_DATE() function:

# to make the string into an actual datetime
UPDATE analysis_job SET ts = STR_TO_DATE(timestamp, '[%d/%b/%Y:%H:%i:%s');

Now there’s a small problem to solve: If the system timezone setting on the server is different from the timezone setting in MySQL on whatever computer you’re using as a database server, and if you want to be able to ask “what happened today in my timezone?” we need to shift the content of the ts column to match your timezone. You can see the UTC offset of the server by looking at the offset column. In my case the original server is on Pacific Time, and I’m on Eastern Time, so we need to add 3 hours to ts:

# adjust timezone
UPDATE analysis_job SET ts = DATE_ADD(ts, INTERVAL 3 HOUR);

If you want to have the Unix Timestamp version of ts available to you then you need to set it:

# Set unix timestamp column
UPDATE analysis_job SET uts = UNIX_TIMESTAMP( ts );

Part 3: Analysis

Here’s the fun part. If you’re good with the ORDER BY and GROUP BY clauses of SQL, and their corresponding functions, you can find all sorts of interesting stuff. Here are some examples that I’ve used:

Find the total bandwidth used during the time your log file spans:

SELECT SUM(size) as totalsize, (SUM(size) / 1048576) as MB FROM analysis_job;

Or for the last week, (change the WHERE clause for different timeframes):

SELECT SUM(size) as totalsize, (SUM(size) / 1048576) as MB FROM analysis_job WHERE ts > DATE_SUB( NOW(), INTERVAL 1 WEEK );

Find the requests, (and therefore files), that use the most bandwidth:

SELECT request, SUM(size) as totalsize, (SUM(size) / 1048576) as MB FROM analysis_job GROUP BY request ORDER BY totalsize Desc;

Do you suspect there’s one computer out there hammering your site? Find the IP that’s using the most bandwidth:

SELECT ip, SUM(size) as totalsize, (SUM(size) / 1048576) as MB FROM analysis_job GROUP BY ip ORDER BY totalsize Desc;

Or the User-Agent:

SELECT useragent, SUM(size) as totalsize, (SUM(size) / 1048576) as MB FROM analysis_job GROUP BY useragent ORDER BY totalsize Desc;

Who is hammering you with the most requests?

SELECT ip, COUNT(*) as requests FROM analysis_job GROUP BY ip ORDER BY requests Desc;

Again, group by User-Agent to find the program, (assuming the User-Agent isn’t spoofed):

SELECT user agent, COUNT(*) as requests FROM analysis_job GROUP BY user agent ORDER BY requests Desc;

Any of these can be restricted by date, like we did with bandwidth above.

You can also zoom in on one User-Agent, for example:

SELECT * FROM analysis_job WHERE useragent='Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)';

Or certain file types:

SELECT * FROM analysis_job WHERE request LIKE '%.css%'; # CSS
SELECT * FROM analysis_job WHERE request LIKE '%.html%'; # HTML

Mix it up for detailed analysis. This finds the IPs that have sent the most hits in the past 5 hours that don’t declare themselves as bots and are accessing normal parts of the website, (not cPanel, stats packages, robots.txt, etc):

# Non-bot IPs that have sent the most hits in the last 5 hours
SELECT COUNT(*) AS cnt, ip FROM analysis_job 
WHERE
ts > DATE_SUB(NOW(), INTERVAL 5 HOUR) 
AND useragent NOT LIKE '%bot%' 
AND useragent NOT LIKE '%spider%' 
AND useragent NOT LIKE 'Mediapartners%' 
AND request NOT LIKE '%/api/1/%' 
AND request NOT LIKE '%/media/%'
AND request NOT LIKE '%/pub/%'
AND request NOT LIKE '%cpanel/%'
AND request NOT LIKE '%mint/%'
AND request NOT LIKE '%/robots.txt%'
GROUP BY ip
ORDER BY cnt Desc;

Part 4: Other Considerations

While this data can help find interesting traffic patterns and trends it is important to remember that IP addresses can be spoofed or proxied and User-Agents can be set to whatever a client wants so those values cannot be entirely trusted.

In addition, if you use CloudFlare, the IP address will almost always be the IP of a CloudFlare server unless your host has mod_cloudflare, or some equivalent, installed.

Even with these caveats server logs are an untapped information mine, and having a chunk of time in a database table makes it much easier to dig in that mine.

Really Purging a YouTube Iframe in Internet Explorer

I’ve been working on a website where we display YouTube videos in a modal overlay. This seems simple: absolutely position a div and centre YouTube’s iframe embed within it. When the user tires of the video remove the iframe and hide or remove the div. It works great, then I fired up Internet Explorer for testing.

In IE this is what appears in the Javascript console,  after the iframe is removed:

Errors in the IE Javascript Error Console.
These errors appear after html5player.js shouldn’t exist anymore.

With the iframe gone html5player.js and its code shouldn’t  exist anymore and therefore should not throw errors. It turns out that when an iframe is removed from a web page IE doesn’t completely purge the contents of the iframe from memory right away, if ever.

How do we fix it?

The errors occurred when using jQuery to set the innerHTML of the containing div to an empty string, so I started trying to explicitly remove the iframe’s DOM node using pure Javascript. Element.removeChild() the node instead of destroying it, so that didn’t kill the errors. The docs for IE’s removeNode() don’t specify if the node is held in memory, but the errors continued with it as well.

Some Googling turned up, (I forget where), that setting the iframe’s src attribute to an empty string might help. This makes sense because it’s the equivalent of navigating the iframe to a new page, which should unload any Javascript. Trying to set src=” worked, most of the time. When it didn’t work IE went crazy and became unusable.

In the end I took the approach of trying to wipe out the contents of the iframe as thoroughly as possible, first by changing the iframe’s src to an empty string, then by accessing the iframe’s DOM with Javascript and setting its document element to null. This did the trick. Here’s the function:

// note: modal is the div that contains the YouTube iframe
// jquery is used for some selecting, but the guts of the removal are pure Javascript

function purgeVideo() {
    // purge video must be called while the video is still displayed on the page, 
    // so doing a jQuery fadeOut(purgeVideo) may throw errors, (I think because 
    // fadeOut sets display:none before calling its callback).

    if( modal ) {
        // get the iframe, declare variables
        var ifr = $('iframe', modal), rifr;
        
        // if the iframe is found get the underlying DOM node from jQuery
        if ( ifr.length > 0 ) {
            rifr = ifr.get(0);
        }

        // make extra sure the DOM node exists before starting to work on it.
        if( rifr ) {
            // set src to '' to start removing YouTube and get around cross-domain access restrictions
            rifr.src = '';

            // now set the document element of the iframe to null, (I think this is IE-specific)
            if ( rifr.documentWindow ) {
                rifr.documentWindow.document = null;
                rifr.documentWindow = null;
            }

            // delete the iframe, (also removes it from DOM).
            delete rifr;
        }

        // this won't throw an error, even if the iframe is gone, (we're back to jQuery here).
        ifr.remove();

        // remove anything else, (titles, etc), from the modal overlay
        modal.html('');
    }

}