Darwinian Web
Adam Green's thoughts on the evolution of the Internet

Posts tagged as: link_analysis

Starting work on blog link analysis

Posted on Saturday, March 25, 2006 at 9:27 AM (permalink)

A few weeks ago I proposed an analysis of linking patterns between bloggers to measure the frequency of links based on rank. I ran into a few snags, such as getting fed up with some limitations in Ruby and hitting a limit on the number of API calls Technorati would allow per day. I've now finished reading up on a few Web languages, and I've decided to give PHP a try. I plan on working my way through Perl and Python also over the next few months. I've done some work with each of them, but that was before I got interested in XML. Technorati has also generously agreed to boost my daily allotment, so I should have no problem getting the blog rank data I need.

My basic goal is to determine whether bloggers tend to link mostly to others with a similar rank (Crosslinking), or to those with higher (Uplinking) or lower (Downlinking) rank. To do this I will first extract the set of bloggers listed at one time on Tech Memeorandum and use them as my sample. Let's call them the target list of bloggers, or target bloggers for short. I can use the Technorati API to determine the rank of each target blogger, and then split them into 3 groups, with a rank of less than 1,000 signifying membership in the A-list, between 1,000 and 10,000 being the M-list, and the remainder being the Z-list. This analysis alone will be interesting, and I may start compiling longer term statistics on these results. I'll certainly publish this intermediate result here.

The tricky part is determining the rank of the blogs that the target bloggers link to. I can autodiscover the target bloggers' RSS feeds and extract the links from their posts, but Technorati doesn't give a rank for just a post URL. It needs to know the URL of a blog's home page. So what I have to do is autodiscover the RSS feeds of blogs the target bloggers link to, and then look in those RSS feeds to find the home page URL, which can then be used to determine the rank on Technorati. At that point I can count the number of uplinks, downlinks, and crosslinks made by each target blogger. This data can then be analysed.

One systematic error is that A-listers can't uplink, and Z-listers can't downlink. I may just keep the crosslinking results and use them to see how often A-listers, M-listers and Z-listers crosslink as a percentage of their total links.

If you had a problem following this plan, here is the basic path I need to follow: Parse Tech Memeorandum -> Target Blog URLs -> Autodiscover Target Blogs RSS Feeds -> Parse Target blog posts -> Links from target blogs to other blog posts -> Autodiscover RSS Feeds of the linked blog posts - > URLs of blogs linked to by target blogs ->Rank of blogs linked to by target blogs.

You can follow the coding on my programming blog where I'll post all of my source code and links to my intermediate data sets. I could store all the data in a MySQL database, but I want to make it publicly accessible, so I'll store it in XML files on the website. I'll report on any useful results here as well as the code blog.