Cracking HTML with Unix
Our company uses the enterprise social networking site Yammer. For those who may not be familiar with it, it is a sort of internal Facebook.
Somebody had created a graph of how membership had increased over the months, but the person had left the company. People were interested to see how the membership had increased in the last year. I decided to look at the problem.
Yammer has a members page and the page shows the date the person joined.
If we look at the source of the page, it looks something like this:
<td class="identity">
<div class="info">
<img alt="Status_on" ... />
<a href="???" class="primary yj-hovercard-link" title="view profile">???&/a>
</div>
</td>
<td class="joined_at">
Nov 24, 2010
</td>
</pre>
So, I saved the source for all the members in a file called
members.txt
. This file had about 75,000 lines.
The part I was interested in was the last three lines, or more particularly, the actual date joined. I figured that if I had all the dates joined, I could create the required histogram.
The easiest way to do this was to use grep
to find all the lines
containing joined_at
and then use the context option (-A 1
) the show
the line following. This gave:
<td class="joined_at">
Nov 24, 2010
--
<td class="joined_at">
Nov 24, 2010
--
<td class="joined_at">
Nov 28, 2010
To clean up the output, I used grep -v
which gave:
$ grep -A 1 joined_at members.txt | grep -v joined | grep -v -- --
Nov 24, 2010
Nov 24, 2010
Nov 28, 2010
I was not interested in the day on which people joined, only the month
and year. Also, I wanted it in the format year month
. This was easily
accomplished using AWK
... | awk '{print $3, $1}'
In other words, print the third and first field of the output.
We now have:
... | awk '{print $3, $1}' | more
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2009 Oct
As you will notice, the data is not necessarily in sorted order. In
order to sort numerically, we need the month number, not name, for the
month. Converting ‘Nov’ to ‘10’ is done easily using sed
. I won’t type
the full command, but it looks like this:
`sed 's/Jan/01/;s/Feb/02/;s/Mar/03/;...'
`
So, we now have:
2010 11
2010 11
2010 11
2009 10
All that is left to do is to sort the output numerically (sort -n
) and
then count the number of unique occurrences of each line (uniq -c
).
Doing this gives us:
9 2009 10
1 2010 10
64 2010 11
112 2010 12 403 2011 01 60 2011 02 55 2011 03 23 2011 04 33 2011 05 36 2011 06 18 2011 07 60 2011 08 31 2011 09 42 2011 10 22 2011 11 21 2011 12 22 2012 01 23 2012 02 10 2012 03 40 2012 04
A histogram showing the number of people that joined each month since
- There is an interesting network effect in the above data. As soon as the site grew past a critical point, there was an explosion of new members (which took the number of members to just under half of all members of the organisation) and then the members signing up slowed and became almost constant.
However, what I wanted to show in this post was how having a basic knowledge of Unix tools, it is possible to do some reasonably advanced analytics on what may initially seem to be quite unstructured data.