Bailing over the Long Tail
In my last posting I reported about the first version of ReScope. This tool seeks to help delicious users reflecting on their web-readings. After finishing the second prototype I started to play around and spied some people's tagging interests. That way I found out that watching Adobe's tag cloud might unveil the real hot topics and products of the company. I also found out by talkin to perigin and kjetil on IRC, that interpreting the data presented by ReScope correctly can be quite difficult for others than the author of the tag cloud; and finally I came across a long tail related problem of ReScope's tag cloud visualisation in general. In this posting I discuss this problem and two approaches of solving it.
While I used the first ReScope prototype things looked quite pleasant with my tag cloud. When I tested the tag cloud with the accounts of other delicious users, I found that the implementation of ReScope's tag cloud visualisation works for some of them quite well, too. Others, however, use so many tags, that the entire tag cloud does not fit on a single screen. This is pretty bad, as ReScope uses the entire screen for the tag cloud. As I plan to make Rescope pluggable into existing web-pages, this problem is serious.
This problem is related to the long tail in two ways. Firstly, it is a typical long tail problem because it has its origins from the personal differences of tagging web-resources. Secondly, it is a content related long tail problem because the the tag cloud visualises the frequency of tags by different font sizes. The first aspect is not really a problem as it reflects the personal differences in tagging resources, which I don't want to change. The second aspect is mainly an effect of ReScope's implementation of the tag cloud. So we should have a closer look at it.
Trailing Tags
While spying out the tag clouds of my peers I encountered that some of them use huge numbers of tags. They use so many tags that they won't fit on a single web-browser screen, if they would all be printed in 9pt (the smallest font size I currently use with ReScope). But ReScope increases the font size for more frequently used tags. As the ReScope visualisation orders the tags alphabetically, some important (brightly colored) tags will not appear on the screen. I also played with changing the order, but the results were rather disappointing. So I studied the big clouds more thoroughly, i.e. I realized by looking at them that most of the tags were at the smallest scale, while only rather few tags were more relevant on a global scale.
From the perspective of the long tail, these tags are make the users different. So, the problem is to reduce the number of tags in large tag clouds, while keeping them meaningful and personal to the individual.
Approaching the Problem
The most primitive approach would be to cut off all tags below a given threshold if the tag cloud exceeds a defined number of tags. For example, if the tag cloud gets too big the system may remove all tags that are not used at least five times. This approach is implemented in the user interface of del.icio.us as a user configurable option. There a user can choose if the tag cloud should include that were used at least once, twice, or five times.
If two different types of information are encoded in the same tag could, one needs to assure that the threshold does not hide (possibly) relevant information from the user. For ReScope the problem is only with displaying the global usage of the tags, which could be any number of tags in the long tail. This is not the case for the most recently used tags, which are limited by the 20 most recently bookmarked links. Therefore, I decided to apply the long tail cut off only to those tags that were not recently used. This assures that not relevant information is hidden from the user.
Defining a fixed threshold, however, may not lead to a reasonable reduction of tags in the tag cloud. Although this approach is fully transparent to the user, it does not take the individual tagging differences into account. This implies that users may want to play with the threshold parameter until they receive an appropriate result. The problem of this approach is that users who have large tag clouds will start with a non-optimal visualisation.
A Non-naive Alternative
Due to the limitations of the fixed threshold approach, I reasoned about a better starting situation that allows users to get a quick overview on a single sight. The goal is therefore to offer a good visualisation from the very beginning and release the users from playing with fixed threshold parameters. With the current version of ReScope I implemented a dynamic threshold for the tag cut off.
The core idea is to identify the head, body, and long tail of a tag cloud. The head of the tag cloud holds the key tags of the user - maybe some 4 or five tags. The body of the tag cloud contains all essential tags that are relevant for a user. The long tails contains all the trailing additional information, which we are seeking to cut off the beast.
The dynamic threshold depends on two parameters. The first parameter identifies if the size of the tag cloud is problematic. This information depends partially on the size of the user's screen, i.e. on the number of tags that are likely to fit in the view port. The second parameter is the amount of trailing information. This parameter helps to decide if a threshold is likely to produce a reasonable result.
The algorithm for the dynamic threshold of ReScope is as follows:
- check if the tag cloud exceeds the display limits of the user's view. This limit is given as a number of tags.
- if the tag cloud is smaller than these limits, display the entire tag cloud and stop.
- order the tags by their global frequency.
- check the tag usage at the display limits
- calculate how many extra tags have the same tag usage.
- calculate how many recently used tags are below this usage level.
- reduce the display limits by the number of recently used tags in the long tail.
- if the additional tags exceed the display limit by 5% then cut off at the next higher tag usage, otherwise cut off at the tag usage found at the display limit.
The main improvement of this algorithm over the fixed threshold is that it reflects what is actually accessible to a user. The payoff for the additional computing is that the users can start with a reasonable tag cloud, instead of drowning in the sea of their homegrown meta-data.