cipherdyne.org

Michael Rash, Security Researcher



Trending Low-Volume Google Searches - Introducing Gootrude

Introducing Gootrude UPDATE 06/16/08: After making Slashdot today and reading the comments, I feel I should clarify a few things regarding this post. First, it was pointed out (quite rightly) that Gootrude trends the number of documents in Google's index that contain each search term, whereas Google Trends tracks the search volume associated with a search term. These are not the same thing. However, I would wager that they are related, though certainly not in some nice 1:1 correspondence. That is, if a relatively unique word - say, "myspace" - is mentioned within 100,000 web sites, then the number of times this word is used as a search term in Google will be higher than if "myspace" were mentioned in only 10 web sites just because it is exposed to many more people who would then be interested in all things "myspace". I would also wager that the reverse is true. So, if "myspace" is an extremely popular search term, then it will probably also appear in many more web sites around the Internet. Gootrude attempts to trend the number of hits a given search term returns in Google, and by the above reasoning, there is a loose correspondence between this and the number of times the term is searched for in Google. Perhaps this is not useful, but only time will tell after Gootrude is used by a lot of people. If you will allow the above interpretation, then the remainder of the post below makes sense.

Now for the original post:

The Google Trends project allows you to input search terms like "Myspace", "2008 election", or "Linux", and see how Google tracks how popular these search terms are over time. The resulting graphs can be quite interesting - spikes in search volume can sometimes be correlated against particular news articles and world events, and the Google Trends interface points these out.

This is a handy tool, but there are many search terms that Google Trends does not display any results for. Such terms (such as "Linux Firewalls" - with the quotes) have insufficient search volumes to display graphs according to the error message that Google Trends generates. Fair enough. I suppose that Google sets an internal threshold on search volume, and this threshold could be set for reasons that range anywhere from Google Trends is still experimental to Google not wanting to provide data on how it builds its massive search index for emerging search terms. Either way, I would like a way to see search term trends that Google doesn't currently make available to me.

Although I'm an open source developer and author, search terms related to my projects are not popular enough yet to be displayed by Google Trends. So, I had to roll my own trending mechanism, and this blog post announces the release of a new open source tool Gootrude (see the quick start, source code, and download links) that I wrote to do just this. The basic strategy is to take a collection of search terms defined by the user, automatically query Google for the number of results associated with each of these search terms (this is displayed by Google when doing a web search), and graph these numbers over time with Gnuplot. At this point let me state up front that Gootrude only makes use of data that Google freely provides to everyone with normal web searches, and is meant to be run once per day (so as to not be a pest in terms of the numbers of queries it makes). As an example, if you type in the word "security" into the Google web interface, it will return a string like "Results 1 - 10 of about 1,010,000,000". The "1,010,000,000" number is collected by Gootrude and stored in a file along with the current time.

For the past year, I have sent a set of search terms through Google once per day with Gootrude and the results are displayed below. Visible within the data returned from Google are strange oscillations that vary quite a bit more than I would have expected, and also evidence for what happens when a large site (like linux.com) posts an article about a Cipherdyne project.

First, below is the graph of the fairly unique word "cipherdyne" since late June 2007. The filled-in red curve is the absolute number of search results (taken each day around 1am), and the green line is the 10-day moving average. Gootrude plot of cipherdyne search term As you can see, at the beginning of the graph around July 1st Google steadily shows about 28,000 results for "cipherdyne", but towards the end of July this dips to well below 20,000 only to rebound in August to about 30,000. Then, beginning around March 1st, 2008 the results shoot up to over 100,000 briefly and then back to around 70,000 in May. How does one interpret this data? It seems unlikely that these fluctuations can be entirely explained by "actual" day-by-day changes in how external sites reference the term "cipherdyne" - there must be some index updating component that is internal to Google at work here, and we'll see a better example of this below.

Now, here is the graph of the search term "gpgdir": Gootrude plot of gpgdir search term The most obvious feature of the gpgdir graph above is the large spike to around 60,000 results around May 1st. It turns out that an article was posted to linux.com on the 24th of April, so given that "gpgdir" is not a common word, the spike seems nicely correlated with the posting of the article as it got bounced around the Internet and blogosphere. A more interesting feature perhaps is the sharp cyclic oscillation between July and December 2007. During this time, search results for gpgdir bounced from 1,000 to around 10,000 and back again several times, and the transition each time was fast - making the jump to 10,000 over the course of two days and then stabilizing for about 10 days or so and then back down to 1,000. It is almost as though Google was trying to establish the proper order of magnitude for "gpgdir" search results during this time via a sort of step function.

Finally here is the graph of "single packet authorization": Gootrude plot of single packet authorization Again, we see a dramatic spike in search results - from around 5,000 to well over 50,000 and settling down to about 10,000 around the beginning of June. Although there has been some activity related to SPA in the Ubuntu forums and also in the Gentoo forums, if this caused Google to report the search results as over 50,000 why did this number return so precipitously back to around 10,000? The links have not gone away, but they were probably mentioned on other referencing sites and then moved to less important pages over time on those sites. Perhaps Google is trying to find the appropriate steady state for its search results, and there are many factors that Google takes into account that are not available to the public.

There are lots of unanswered questions this sort of data brings to mind:

  • All of the data for the above graphs was collected from a single Linux system. How different would the results be if several systems in different geographic locations collected the data and the average for each data point was used instead?
  • Each data point was collected around 1am every morning. If the data collection time were, say, 1pm, would the results have been significantly different?
  • What is the "optimal" time scale for the moving average? Given that Google's own Trends interface seems to show search results on the macro level, would a much longer moving average than 10 days - perhaps on the order of several weeks - be a more accurate reflection of search popularity?
One thing is clear - getting search results that are meaningful is much easier with unusual search terms. With the posting of this blog entry, the term "Gootrude" should evolve nicely within Google results, and the graph of these results will be updated daily on the main Gootrude page so you can see this evolution as it unfolds.

In closing, I would like to mention that Gootrude is just getting started, so there are lots of enhancements that need to be made. Some of the most important features to develop are:

  • Integration with the Google Charts API.
  • Development of an online web portal for Gootrude so that users don't have to have their own infrastructure to run Gootrude.
  • Ability to import search data from different Gootrude collection systems.
  • Add support for data collected from additional search engines.
If you are an open source developer and would like to contribute, see the TODO file for an updated list of development tasks, or send me an email (mbr[at]cipherdyne[dot]org). Also, if you have any ideas or feedback on why some of the graphs above look the way they do in the context of how Google builds its index, please email me.

Finally, here are a few additional graphs of search terms over the past year: Gootrude plot of fwsnort Gootrude plot of michael rash Gootrude plot of Linux Firewalls Attack Detection Gootrude plot of single packet authentication Gootrude plot of iptables attack visualization