You don't even know what you like

Over the past year or so, I noticed that my Twitter usage started dropping off. I felt less compelled to check it regularly, usually didn't bother reading it on my morning walk and found myself "keeping up" with the stream less and less as time progressed.

I figured this was a fairly natural decline - perhaps the novelty had worn off, people had gotten repetitive, the signal-to-noise ratio had gotten worse or I'd just simply ended up following too many uninteresting people out of politeness. I noticed sometimes when I opened twitter on my phone that it had been almost two weeks since I'd last bothered to check it.

A pity, sure, but ultimately not a huge loss to my life - though I did wonder whether this was an endemic problem with the Twitter ecosystem itself, and could end up resulting in an eventual collapse (or decline) of the major value twitter has.

I had noticed that other people (friends, etc) did not seem to be consistently suffering this same issue, particularly more recent adopters, which tended to confirm my theory that it was some sort of time/usage based decline.

Then I upgraded from an iPhone 3G to a new 4S.

Suddenly, twitter was fun - I bothered to check it regularly, I started tweeting, the conversations got more engaging. I became an active twitter user again. The product had gotten so much better - and all because my new/faster phone made it easy and enjoyable to quickly check the stream.

What I thought was a decline in the community and product was actually a performance problem.

I didn't even know why I hadn't liked it, I just voted with my feet (or fingers/whatever) and stopped using the service.

Ostensibly, the irritation in waiting for the app/stream to load, the pausing and jerkiness in navigation and just general slowness in using the product had culminated in a poor enough user experience that it simply wasn't worth checking twitter anymore.

Interestingly, the awful mobile experience stopped my usage of twitter on the desktop as well, even though the twitter website is as fast as ever.

For those of you who are building products of your own, it's highly likely that your users probably don't even know what they like/dislike about your service (I didn't). It's probable that several of the irritating things that stop them from coming back are subtle and they're completely unaware of how it's affecting their usage.

Consequently, good luck getting any user feedback about that.

The worst kind of problems are those that don't look like problems - "Oh, that feature works". Works, yes, but how many irritating little issues does it have? How many times does it make a user go through the same pain point again and again (clicking an extra sort button, navigating a hard-to-hit drop down menu)?

It's easy to not release an app because you haven't finished "major" basic features like password reset, but how many times is a user really going to need to reset their password? How likely is it to cause them to abandon your product?

Stop building the illusion of "core functionality" that will rarely/never be used and optimize your product for the thing it does. Ensure you're making something people want (and I doubt it's a really nice password reset form).

This is yet another of those things that Apple tends to get right. They'll leave seemingly obvious stuff out entirely (remember the lack of MMS in iPhones?) in exchange for real core functionality that is polished. Note that polished is not perfection - Apple are not perfectionists in the "we can't release it until it's perfect" camp - they release often, with many changes. Just look at the litany of iOS bugs and major improvements throughout the versions. Polish is iteration, not delay. Each day your product should be shinier than the last.

This is something we've realised whilst building adioso. Over time, through various feature experiments, it has most definitely developed many niggling issues. We're aware of a lot of them, some easy to solve, some very much not, but there's certainly a whole host of problems that we have no idea about yet.

So how do we track them down, if users don't even know what they like? They may not know, but they will respond to change so that you can learn. Eric Ries can tell you how to do this.

Watch that video, then go forth and iterate until you've made something people want.

 

Microcaching: Speed your app up 250x with no new code

I recently had the opportunity to help some friends out preparing a content site (wordpress) for a fairly hefty traffic hit. It was potentially going to be a big spike (national radio campaign, time sensitive content, etc) and they particularly didn't want it to go down at the critical time.

I put together a fairly typical "fast" PHP architecture: nginx, PHP-FPM, APC, front-end app cluster, load balancer, replicated DB, along with all the mess that comes with it - machine images, replicated filesystem, etc, etc. Additionally, installed/tested the various appropriate Wordpress Super-Hyper Cache Pro Blitzen 2000+ plugins.

After much mucking around, I had an awesome complicated, linearly scalable difficult to manage, app cluster that could scale to the stars very easily develop non-obvious bottlenecks.

It turns out that you can throw all of this out and replace it with a 23 line nginx config. Oh yeah, and you get a 250x per-node performance increase too.

How? Caching. But only a little bit. Here's a technique cheap trick that lets you get blinding performance and serve up-to-the-second fresh content, without having to write a whole bunch of app code.

Concept

Microcaching is like an insulation layer for your app - Let's say your wordpress install (or rails app) can handle 20 requests/sec fairly happily. This is fine, up until the point where you get on HN and Reddit at the same time (greatest day of your life) and right at the critical time, your site collapses spectacularly amidst the deafening snarky jeers of your peers.

The idea behind microcaching is to cap the amount of requests that can make it through to your app by letting nginx bear the brunt of your pageviews by caching content for a very small amounts of time (ie: 1 second or less).

From your app's point of view, it can only ever be hit by a maximum of 1 req/sec per page of content it needs to serve, so in wordpress terms, if everyone is hitting your front page or a specific post, the vast majority of requests can be served out of cache.

At the same time, the classic problem of stale content/cache invalidation is basically nil - nobody's going to realise if the content they're seeing is 800ms old. Probably...

Changing Data

Where you often come unstuck on something like a blog is with comments - you don't want a user who has just submitted a comment to then have it disappear when the page reloads. Thankfully, nginx's request handling is smart enough to deal with this - An nginx microcaching config:

The Config

There's nothing particularly clever going on here but it may be worth breaking down a couple of the entries - 

if ($no_cache = "1") {
    add_header Set-Cookie "_mcnc=1; Max-Age=2; Path=/";
}

if ($http_cookie ~* "_mcnc") {
    set $no_cache "1";
}

What we're saying is that if a request has been made that could modify the content (ie: a POST or PUT), add a special "no cache" cookie to the response, so that we know this user can't be served cached results for 2 seconds.

The other problem that's likely to occur with high traffic + caching is the thundering herd phenomena, wherein as a cache expires, all subsequent requests then require a real response to be generated. If you're doing 1000 cached reqs/sec and your dynamic page generation time is 200ms, then you're going to receive 200 more requests in the time it takes to refresh your cache, quite possibly taking down your server/app anyway.

Nginx has a way of dealing with this too:

proxy_cache_use_stale updating;

Ostensibly, this allows nginx to serve (slightly) stale responses whilst it's waiting for a refresh to complete. Neat.

Benchmarks

Enough with the talking - Let's see some benchmarks. The setup below is a clean EC2 small instance with a fresh unmodified wordpress install running on a pretty standard LAMP stack (Apache2 + PHP5 + PHP-FPM + APC + MySQL).

Vanilla wordpress, microcaching disabled, 200 requests, concurrency of 4 (conclusions below):

Woaah, 9.94 reqs/sec. That's pretty woeful. Of course, I could install Wordpress caching plugins, buy a bigger EC2 instance, tweak my PHP config, etc, etc. Or I could enable microcaching:

Nginx installed, microcaching enabled, 10,000 requests, concurrency of 500:

2364 reqs/sec. That's more like it - and this is on a single-core, contended cloud-server. From my quick tests, you can get about 7500 reqs/sec out of a 4 core box, which, to be honest, should be enough for anyone (ie: if you're getting 648 million pageviews per day, put some ads on your site and buy another damn server).

Some of you will notice that the attached config is actually broken for the WP-admin (tho easily fixed). Some of you will also notice that this is obviously not a silver bullet -

If you have personalized pages (ie: majority logged-in users) this approach isn't going to work. Similarly if you have a very write-heavy workload or long-tail of content, it's going to have reduced (but still useful) utility.

That said, for the amount of effort required for implementation, it's a nice insurance policy.

Fast Free Speech Recognition using Google's Infrastructure

A few of evenings ago, after a couple of glasses of red wine, I was wondering how Chrome's new(ish) x-webkit-speech voice input tag worked.

Scalable, reliable, speech-to-text in opensource environments is irritatingly hard to do and it seemed (as with spam filtering) that Google had done a lot of the hard work already here, and had a vested interest in it being accurate, performant and maintained.

A bit of digging through Chromium's source revealed a public API that will take audio data via HTTP and turn it into a JSON response. Cool.

A bit more digging revealed the expected audio formats and a few more internal details about the structure of the HTTP request.

All pretty straight forward - the API expects a HTTP POST of a 16bit, 16Khz mono audio stream encoded as either FLAC or Speex binary data.

The first step is generating an appropriate audio stream. Sox is a great open source tool that lets you play, record and convert audio between a variety of formats. Recording is as easy as rambling into the microphone after running:

$ rec -r 16000 -b 16 -c 1 test.wav

Looking at the Chromium source, there is quite a lot of effort spent on normalizing and trimming the audio prior to uploading - Thankfully, sox also provides this functionality.

To convert to FLAC, trim and normalize is:

$ sox test.wav test.flac gain -n -5 silence 1 5 2%

Breaking that down:

  • sox test.wav test.flac (convert to FLAC format)
  • gain -n -5 (normalize audio to -5 db)
  • silence 1 5 2% (trim silence from beginning/end of file based on quietness threshold)

You can also convert to Speex format the same way.

Pulling the API details out of the source, it's quite easy to perform a transcription request using curl to submit the binary data to the API endpoint:

And voila, the output:

There are some great hacks and integrations you could do with this. Things I haven't tried:

  • Determining maximum length of audio transcription
  • Testing with low-quality audio (ie: telephone)
  • Playing with normalization/settings to maximize quality
  • Any sort of streaming-batching to attempt (close to) realtime transcription

Any other (semi) secret Google APIs I should know about?

 

node.js - A giant step backwards?

A week into my first serious real-world node.js project, I was convinced it was the biggest step backwards since the Great Leap Forward.

6 weeks on, I'm definitely not so sure.

As a quick preface, the bulk of my recent work has been largely in python, with only limited production experience in async frameworks (ie: Tornado - I'd heard enough horror stories from Twisted I'd avoided it so far).

So right of the bat with node, I was very quickly put-off by how messy very simple things got almost straight away.

Some Examples

Consider the trivial case of checking for an entry in cache then fetching it from a datastore if required.

Synchronous version:

Fairly standard stuff, now look at a naive async version:

Woah, now I've ended up with two different code-paths where my result could end up and I can't share common code. The only way to stick to DRY principles here is to create a new function like doOperationOnThing() and pass the result in. But you'd then have to do that for every case where you wish to check the cache.

Alternately, you could create a cachedDB.getFromCacheOrDB(...) type method and provide the functionality this way, but this is really only one of many very basic programming constructs (the "if" statement in this particular case) that don't work well with async I/O.

To my horror, it wasn't just if statements that didn't work well when you introduce async I/O into them, but loops as well. Any kind of loops. What kind of joke programming revolution was this?

An illustrative example:

Caveat Emptor

There is so much wrong with this code it's hard to explain to an async beginner where to start. The most immediate point being that this theoretical function would likely return at worst an empty array and at best a (probably incomplete) result set... in a random order.

Everything I'd heard was that node.js was that it was supposed to make it easier to write concurrent, asynchronous apps. This didn't seem easier to me.

So what's going wrong in the above example? As a brief explanation: For those just starting out, it's sometimes easier to think of async I/O (or any async) operations as queueing the I/O request, where the results will arrive at a random time "later", at which point a callback is called (passing in the results).

So what's happening above is we're queueing a whole bunch of requests to get blog posts out of the DB (ignoring the fact that this is a fairly inefficient way to do things in the first place), which will all execute in parallel and return sometime "later".

Because of this, once the requests are queued (relatively instantly), the program execution reaches the return statement and bails out of the function before any of the queries have completed, leaving the result empty.

Subsequently, at some point in the future ("later"), each DB request will then return a result will be put on to the array, but this could be milliseconds, seconds or minutes later (if your DB is slow) at which point it will probably be too late and your software has very possibly already returned the result array and discarded the variable (ie: from a web point of view, you'd likely render a blank page).

Now What?

So, if if statements don't work and loops don't work. Why am I bothering with this?

Well, it turns out that all this pain and suffering in primitive behaviours makes for a lot of ease in genuinely hard stuff like parallel programming. Ostensibly, all this hoop jumping allows node.js (or any async environment) to deal with parallelism by largely throwing it out the window.

This is by no means new, it's a technique that's been used extensively for eons, great/popular examples being Twisted, EventMachine, Tornado, Nginx and lighttpd.

Thankfully, the node community is very strong & active, and everyone is running up against these sorts of problems. There are a whole spate of flow-control helper libraries, that make working around these (initially baffling) caveats a lot more manageable.

Where's Your Head At?

Once you get your head around thinking in async terms, node.js starts to actually make a lot of sense. Javascript as a language is an excellent fit for evented I/O -  it has proper anonymous functions and closures, is highly dynamic and comes from a browser environment, where any network operation could take some time, it makes perfect sense that the language has been designed to work this way.

Possibly most importantly - node.js was built from the ground up to be async. Unlike Twisted or EventMachine, where you have to constantly find async versions of all of your drivers (ie: MySQL, memcache, etc), node.js only has async versions of it's various I/O libraries available. You may be able to shoot yourself in the foot with a for-loop but you can't (easily) accidentally make a blocking call and block/destroy your entire server (all too easy to do in Python or Ruby, where things are always synchronous by default).

A nice canonical example (from http://www.catonmat.net/http-proxy-in-nodejs/) is the 20-line proxy server:

Whilst incredibly simple, this proxy is quite capable of handling very high request rates and (more importantly) concurrency, without fooling around with threads, forking, shared data structure, etc. More importantly, you could easily introduce some sort of other IO into the mix, ie: stats collection into redis, without worrying too much about breaking your concurrency capabilities.

Conclusions

So what do I think about node.js now?

If I had to implement some sort of intelligent proxy system, unifying REST API or long-polling/comet system, I'd reach for node before anything else.

These guys agree, and, consequently, have a great fork of node.js on their github that includes Ubuntu packaging configurations, along with a whole lot of other cool node-related stuff.

Would I use node.js for my next major web app?

Perhaps not quite yet. But I might give you a different answer next week.

--

Anyone got any particular love/hate moments or use-cases for node?

 

 

String manipulation tricks in bash scripting

Only recently realised you can do some pretty handy string manipulations in bash, specifically, the ability to do string slicing:

Embarrassingly, this arguably makes bashes string handling more powerful than PHP's (and more like Python's).

This is handy for things like adding/removing trailing slashes, or determining if a path is relative vs absolute.

Why and how to use the Twitter Streaming API in PHP

About 2 weeks ago, Twitter promoted their little-known Streaming API to production status and, further to that, have now recommended that all high volume and repeated search queries should migrate to the new API.

In their own words, the streaming API is a better solution because of:

   - Complete corpus search: Search is focused on result set quality and 
   there are no guarantees to return all matching tweets. Complete results 
   are only available on the Streaming API. Search results are increasingly 
   filtered and reordered for relevance. 
   - Lower latency results: From tweet creation to delivery on the API, 
   latency is usually within a second. 
   - Predictable rate limits: Streaming is built upon well-defined elevated 
   access roles so that client rate-limit-avoidance heuristics are eliminated. 
   - Higher peak capacity: During a peak event, when tweets spike, the 
   Streaming API is less likely to fall behind or begin aggressive rate 
   limiting. Furthermore, the risk of a large client peak capacity emergency 
   blacklisting is reduced. 
   - More consistent results: Hosting a continuously updated REST API on a 
   large cluster inevitably leads to temporal result skew due to 
   internal propagation delay. This issue is largely eliminated by long-lived 
   connections. 
   - More efficient: Bandwidth and processing are not wasted 
   on identical results. Also, repetitive and long-tail queries are processed 
   more efficiently in the Streaming architecture. 
   - Improved Search experience: Shifting the heaviest users away from 
   Search should dramatically improve the overall Search experience. Resources 
   can be allocated to the search architecture's strength: historical, complex 
   and high value queries. 

That said, the stream API is harder to use than a traditional REST resource - Rather than firing off a GET request with a search query attached whenever you want some Twitter data, the streaming API works by connecting only once and then consuming the stream (or "drinking from the firehose") continuously (or at least until you want to change your stream configuration).

In the PHP-centric view of the web-world, this is not really compatible with the "normal" way of doing things, ie: being firmly in the world of HTTP requests, cron jobs and short-lived processes. It would be very easy to do a bad job of trying to consume the stream and ending up banned from the service (which it will do if you hammer it with badly configured connections).

Consequently, about 3 months ago I released an alpha version of the Phirehose PHP library for the (then alpha) Twitter Streaming API. It allows you to very easily connect and consume the stream without having to worry about the complexity of connection handling, persistent HTTP connections and filter predicate updates. Consuming the stream is as easy as:

class MyStream extends Phirehose
{
  public function enqueueStatus($status)
  {
    print_r($status);
  }
}

$stream = new MyStream('username', 'password');
$stream->consume();

You still need to understand the basics of running long-lived PHP CLI processes and how to process the stream once you've consumed it but the library does a lot of the messier bits for you that PHP is traditionally fairly bad at and includes some simple examples around handling and processing statuses.

Since release, I've had a few hundred downloads and some feedback that has allowed me to improve the library which is now in production powering a variety of services and I'm now considering it largely stable. If you're looking for a way to integrate with the (now recommended) Twitter Streaming API, it might make things easier for you. 

The API is listed on the Twitter Libraries page or you can find it on Google Code here: http://code.google.com/p/phirehose/

If you're interested in using the streaming API in other languages, I suggest you check out:

Two iTunes accounts on one iPhone

A little known feature of the iPhone (at least from my discussions until recently) is the ability to have two (or more) iTunes/App store accounts linked to one iPhone. Why do this?

The main reason (for Australians anyway) is to be able to have an Australian App Store account for apps like Metlink, OzWeather, etc and a US account for apps like Photoshop for iPhone, Pandora and the BART timetables (if you visit San Franciso for example) or any other app that gets released US only for whatever reason.

Similarly, it may be possible to get US only media content (ie: TV shows) too though I'm not sure about either the ability or legality of doing this, so if you get in trouble, don't blame me.

To do this is remarkably easy, the point that I'm not even going to describe it here, but rather just link to two other guides that I simply jammed together.

First, create a US App Store Account:
http://www.iphoneworld.ca/iphone-guides/2009/02/22/how-to-create-a-us-itunes-...

Then, associate the account with your iPhone (OS 3.0+ required, I used 3.1.2 on a iPhone 3G):
http://theappleblog.com/2009/04/29/iphone-os-30-beta-4-multiple-itunes-accounts/

Does it work? Oh yes.

Photo

I'm currently listening to streaming music via Pandora (US only) on my iPhone whilst writing this :) Admittedly, I'm actually in the US at the moment - I have a feeling the stream will be blocked to non-US IP addresses.

Now, what I cannot tell you is what happens when you sync your phone with iTunes (including apps/etc). It may end up wiping the non-associated apps, it might not. What I can tell you is that providing you manage your apps ONLY from your iPhone (which I do anyway), it works like a charm.

Interested in hearing comments from other people's experiences.

Twitter Retweet API is all about TweetRank

Twitter's announcement today of a retweet API is all about search and not really about the current RT convention being (as Biz says) "a bit cumbersome".

I'm definitely not the first person to comment on this, in fact, Topsy has been working on this principle for 3 years, but Twitter's changes are a step towards ensuring that one of the most powerful features of Twitter (search) stays in house.

One of the main problems with Twitter Search is that it has little to no "rank" or reputation information - that is, a nameless-nobody spamming the twitter stream will come up just as readily in search as a high profile celebrity or commentator. Whilst this may be desirable in some cases, it seriously limits the usefulness of the search feature in terms of seeing what is going on that actually matters.

Google's famous algorithm, PageRank, solves this partly by using inbound links to build credibility, ie: the more people who link to you, the more important you are. Twitter has always had this to some degree, in that the number of followers could in theory be used to denote how important a given individual may be, but as we've all seen recently, it's way too easy to get lots of useless/fake followers through the use of various spam/follow-bots/etc.

Enter the retweet API - All of a sudden, the amount of social worth can be tracked by how often your tweet gets re-posted, the theory being that the more valuable what you have to say is, the more often it will be retweeted, hence building your reputation or "social worth". As the TechCrunch article (and many others) have said: "Retweeting becomes the currency of the web".

With social rank (TweetRank?) being considered, all of a sudden Twitter (or other realtime search engines) can return results based on importance rather than just recency. Being at the top of realtime search results is quickly becoming potentially as valuable as coming on the first page of Google (imagine searching for "best restaurant" and getting nearby recommendations based on rank rather than recency).

Similarly, the retweet API opens up a whole new range of twitter abuse (ie: auto-retweet bots, retweet baiting and much, much more), so it's going to be interesting to see how this plays out. I'll be interested to see how much of the power of the API is opened up to third parties with respect to search - ie: will Topsy/Google get access to this data or is that portion going to remain in-house?