Cracking HTML with Unix

Our company uses the enterprise social networking site Yammer. For those who may not be familiar with it, it is a sort of internal Facebook.

Somebody had created a graph of how membership had increased over the months, but the person had left the company. People were interested to see how the membership had increased in the last year. I decided to look at the problem.

Yammer has a members page and the page shows the date the person joined.

If we look at the source of the page, it looks something like this:

<td class="identity">
<div class="info">
<img alt="Status_on" ... />
<a href="???" class="primary yj-hovercard-link" title="view profile">???&amp;/a&gt;
</div>
</td>

<td class="joined_at">
    Nov 24, 2010
</td>
</pre>

So, I saved the source for all the members in a file called members.txt. This file had about 75,000 lines.

The part I was interested in was the last three lines, or more particularly, the actual date joined. I figured that if I had all the dates joined, I could create the required histogram.

The easiest way to do this was to use grep to find all the lines containing joined_at and then use the context option (-A 1) the show the line following. This gave:

<td class="joined_at">
    Nov 24, 2010
--
<td class="joined_at">
    Nov 24, 2010
--
<td class="joined_at">
    Nov 28, 2010

To clean up the output, I used grep -v which gave:

$ grep -A 1 joined_at members.txt | grep -v joined | grep -v -- --
Nov 24, 2010
Nov 24, 2010
Nov 28, 2010

I was not interested in the day on which people joined, only the month and year. Also, I wanted it in the format year month. This was easily accomplished using AWK

... | awk '{print $3, $1}'

In other words, print the third and first field of the output.

We now have:

... | awk '{print $3, $1}' | more
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2009 Oct

As you will notice, the data is not necessarily in sorted order. In order to sort numerically, we need the month number, not name, for the month. Converting ‘Nov’ to ‘10’ is done easily using sed. I won’t type the full command, but it looks like this:

`sed 's/Jan/01/;s/Feb/02/;s/Mar/03/;...'`

So, we now have:

All that is left to do is to sort the output numerically (sort -n) and then count the number of unique occurrences of each line (uniq -c).

Doing this gives us:

9 2009 10
1 2010 10

64 2010 11

112 2010 12 403 2011 01 60 2011 02 55 2011 03 23 2011 04 33 2011 05 36 2011 06 18 2011 07 60 2011 08 31 2011 09 42 2011 10 22 2011 11 21 2011 12 22 2012 01 23 2012 02 10 2012 03 40 2012 04

A histogram showing the number of people that joined each month since

There is an interesting network effect in the above data. As soon as the site grew past a critical point, there was an explosion of new members (which took the number of members to just under half of all members of the organisation) and then the members signing up slowed and became almost constant.

However, what I wanted to show in this post was how having a basic knowledge of Unix tools, it is possible to do some reasonably advanced analytics on what may initially seem to be quite unstructured data.