Starting work on blog link analysis
Posted on Saturday, March 25, 2006
at 9:27 AM
(permalink)
A few weeks ago I proposed an analysis of linking patterns between bloggers to measure the frequency of links based on rank. I ran into a few snags, such as getting fed up with some limitations in Ruby and hitting a limit on the number of API calls Technorati would allow per day. I've now finished reading up on a few Web languages, and I've decided to give PHP a try. I plan on working my way through Perl and Python also over the next few months. I've done some work with each of them, but that was before I got interested in XML. Technorati has also generously agreed to boost my daily allotment, so I should have no problem getting the blog rank data I need.
My basic goal is to determine whether bloggers tend to link mostly to others with a similar rank (Crosslinking), or to those with higher (Uplinking) or lower (Downlinking) rank. To do this I will first extract the set of bloggers listed at one time on Tech Memeorandum and use them as my sample. Let's call them the target list of bloggers, or target bloggers for short. I can use the Technorati API to determine the rank of each target blogger, and then split them into 3 groups, with a rank of less than 1,000 signifying membership in the A-list, between 1,000 and 10,000 being the M-list, and the remainder being the Z-list. This analysis alone will be interesting, and I may start compiling longer term statistics on these results. I'll certainly publish this intermediate result here.
The tricky part is determining the rank of the blogs that the target bloggers link to. I can autodiscover the target bloggers' RSS feeds and extract the links from their posts, but Technorati doesn't give a rank for just a post URL. It needs to know the URL of a blog's home page. So what I have to do is autodiscover the RSS feeds of blogs the target bloggers link to, and then look in those RSS feeds to find the home page URL, which can then be used to determine the rank on Technorati. At that point I can count the number of uplinks, downlinks, and crosslinks made by each target blogger. This data can then be analysed.
One systematic error is that A-listers can't uplink, and Z-listers can't downlink. I may just keep the crosslinking results and use them to see how often A-listers, M-listers and Z-listers crosslink as a percentage of their total links.
If you had a problem following this plan, here is the basic path I need to follow: Parse Tech Memeorandum -> Target Blog URLs -> Autodiscover Target Blogs RSS Feeds -> Parse Target blog posts -> Links from target blogs to other blog posts -> Autodiscover RSS Feeds of the linked blog posts - > URLs of blogs linked to by target blogs ->Rank of blogs linked to by target blogs.
You can follow the coding on my programming blog where I'll post all of my source code and links to my intermediate data sets. I could store all the data in a MySQL database, but I want to make it publicly accessible, so I'll store it in XML files on the website. I'll report on any useful results here as well as the code blog.
Revenge of the M-list
Posted on Wednesday, March 1, 2006
at 7:36 AM
(permalink)
I chose this site's title and tagline, because I am fascinated by the social and technical forces causing change on the Internet. One aspect of that fascination is the role of the blogosphere's A-list. At first there was a single list covering all bloggers, but as the blogosphere has segmented along lines of interest, such as tech and politics, distinct A-lists have emerged. But the principle has stayed the same, within each niche there is a small clique that is read the most, and gets the most traffic and links.
It is easy to see how this clique's rank is maintained by reading Robert Scoble's blog (no general link is required, because you are already subscribed). I've never met Scoble, but I'm sure I'd like him in person. I don't think he'd let you not like him in person. He does, however, have an annoying habit of writing "That's great [A-lister]!" posts, a phenomenon known as getting scobled. For example, today he starts his post on the new Tech Memeorandum design with "Hey, Gabe, love the new design of Memeorandum!" For the technical quibblers, Gabe Rivera is not an official A-List blogger, since he rarely blogs, but his site gives him honorary status. (Oops, mustn't piss off Gabe. I love the redesign too, Gabe!)
Scobling is practiced by many A-listers, which is best demonstrated by the huge A-list threads that develop on Memeorandum whenever Mike Arrington has a party or releases a product. Since traffic and therefore rank is determined by the number of links, scobling is clearly self-reinforcing. This leads to accusations of A-listers being guilty of gatekeeping. Darwin explained this in terms of mating habits, where members of the same species have common behaviors, such as times of activity and feeding patterns that encourage intra-species sexual activity. Ironically, even hate sites reinforce the rank of A-listers.
This begs the question, how did A-listers get on the list in the first place? Darwin's response to the same question applied to origin of new species was that some new competitive pressure or geographic isolation must have emerged to cause the standard lines of association to break down. In the blogosphere this takes the form of a new technological advance in the tech world, or a new scandal in the political world. I've seen many blogosphere analysts use this argument to explain how an A-lister emerged after a burst of blogging surrounding a story of extreme interest.
I think there is also a more gradual phenomenon that I call the revenge of the M-list. When a new area of interest develops, such as what we are now seeing with OPML reading lists, a group of mutually linking bloggers emerges. If one of these bloggers is an A-lister, then the majority of the links point to his or her posts on the subject. If, on the other hand, the inter-linkers are all middle ranked bloggers, let's call them M-listers, they tend to link to each other fairly liberally. As new people become interested in the subject, they find these clusters of posts (memetracking sites do a great job of revealing M-list clusters), and also link to many of the blogs in the cluster, since there is no one recognizable A-lister to link to exclusively. In time the M-lister who is most prolific on this subject, but not necessarily the best writer or scobler, acquires even more links. Eventually this blogger becomes the authority on the subject, and even A-listers take note and deliver links. The resulting accumulation of links are enough to reach A-list status. Thus we have a slow bubbling up from the middle, rather than the overnight success story so often told by analysts.
There is a simple experiment to test this theory. First all the posts appearing on a memetracker can be extracted, and the bloggers separated by Technorati rank into A-list and M-list. Then the inter-links can be analyzed to see if M-listers do indeed form clusters of links with other M-listers. We already know that A-listers inter-link, but that can be tested for as well. Finally, these groups of M-list linkers can be followed over time to see if any individuals rise in rank to A-list status. A group of M-listers not in a cluster should also be followed as a control. I'm planning on working on Technorati rank analysis of Tech Memeorandum posts this week, so I'll see if I can contribute some scripts or XML data files that can assist in this experiment.