Posts tagged as: technorati
Starting work on blog link analysis
Posted on Saturday, March 25, 2006
at 9:27 AM (permalink)
A few weeks ago I proposed an analysis of linking patterns between bloggers to measure the frequency of links based on rank. I ran into a few snags, such as getting fed up with some limitations in Ruby and hitting a limit on the number of API calls Technorati would allow per day. I've now finished reading up on a few Web languages, and I've decided to give PHP a try. I plan on working my way through Perl and Python also over the next few months. I've done some work with each of them, but that was before I got interested in XML. Technorati has also generously agreed to boost my daily allotment, so I should have no problem getting the blog rank data I need.
My basic goal is to determine whether bloggers tend to link mostly to others with a similar rank (Crosslinking), or to those with higher (Uplinking) or lower (Downlinking) rank. To do this I will first extract the set of bloggers listed at one time on Tech Memeorandum and use them as my sample. Let's call them the target list of bloggers, or target bloggers for short. I can use the Technorati API to determine the rank of each target blogger, and then split them into 3 groups, with a rank of less than 1,000 signifying membership in the A-list, between 1,000 and 10,000 being the M-list, and the remainder being the Z-list. This analysis alone will be interesting, and I may start compiling longer term statistics on these results. I'll certainly publish this intermediate result here.
The tricky part is determining the rank of the blogs that the target bloggers link to. I can autodiscover the target bloggers' RSS feeds and extract the links from their posts, but Technorati doesn't give a rank for just a post URL. It needs to know the URL of a blog's home page. So what I have to do is autodiscover the RSS feeds of blogs the target bloggers link to, and then look in those RSS feeds to find the home page URL, which can then be used to determine the rank on Technorati. At that point I can count the number of uplinks, downlinks, and crosslinks made by each target blogger. This data can then be analysed.
One systematic error is that A-listers can't uplink, and Z-listers can't downlink. I may just keep the crosslinking results and use them to see how often A-listers, M-listers and Z-listers crosslink as a percentage of their total links.
If you had a problem following this plan, here is the basic path I need to follow: Parse Tech Memeorandum -> Target Blog URLs -> Autodiscover Target Blogs RSS Feeds -> Parse Target blog posts -> Links from target blogs to other blog posts -> Autodiscover RSS Feeds of the linked blog posts - > URLs of blogs linked to by target blogs ->Rank of blogs linked to by target blogs.
You can follow the coding on my programming blog where I'll post all of my source code and links to my intermediate data sets. I could store all the data in a MySQL database, but I want to make it publicly accessible, so I'll store it in XML files on the website. I'll report on any useful results here as well as the code blog.
scrAPI for OPML
Posted on Wednesday, March 22, 2006
at 6:34 AM (permalink)
Now there's a title to conjure with. John Musser has an interesting ProgrammableWeb post on the use of screenscraping as a poor man's API. The idea is to use a script to parse a web page, and then return some specific set of data in an XML format. John credits the idea to Thor Muller, who provides some excellent details on the pros and cons of a scrAPI vs. an official API. Thor in turn recognizes Paul Bausch for coining the term SCRAPI in 2002. John notes that the result of the scrAPI can be returned "in some cleaner XML format." Yeah, like ... uh ... OPML?
A scrAPI is clearly your only alternative for a site that doesn't offer an API, but surely once an API is available you wouldn't need to adopt the scrAPI approach. Except when the API provider has a limit on the number of times you can call the API each day, and refuses to respond to email requests on how to get beyond the limit. I ran into exactly this problem with Technorati when I built my Tech Memeorandum - Technorati mashup. I hit an undocumented limit on daily Technorati API calls, and despite repeated emails to the company, including directly to Dave Sifry, I never got a response. I've been reading up on PHP the last few days, and this sounds like exactly the type of example project I should try out. Yes, it is a terrible kludge, but then what isn't on the Web? I think scrAPIs that return OPML can become a useful way of building a mashup, and I'll experiment with it until a better coder can implement the idea more cleanly.
Update: I guess it isn't too surprising that Dave Sifry would have a vanity feed at Technorati. At least I assume he has, because soon after this post pinged Technorati, I got an email response from him to my earlier messages. If they actually do give me a larger daily allotment of API calls, I'll be able to start posting some Technorati API scripts on my programming blog. If not, I'll just have to start scraping. Getting something done because of a blog post is fun, but it encourages bad behavior. This is probably how Scoble turned into the spoiled kid he can sometimes be. I love it when he starts yelling on his blog, "This blog is broken, and somebody better fix it right now!" I wonder if his wife lets him get away with that stuff at home, "I'm hungry, and someone better feed me right now!"
Update: Sean O'Hagan emailed me to point out the scrape command he wrote for YubNub. I guess this lazyweb thing really works.
Technorati - Memeorandum mashup
Posted on Sunday, March 5, 2006
at 4:10 PM (permalink)
I finally got a chance to explore the idea of storing API results in an OPML file. The window below shows the results in the Optimal Browser. You can also retrieve the raw OPML file here. To create this OPML file I took the list of blogs that my Ruby script has been collecting from Tech Memeorandum and searched the Technorati API with each blog's URL. I then combined the blog's rank, number of blogs linking in, and the most recent blog posts linking to the URL into a single OPML outline. I still have a number of programming details to work out, so I won't be pubishing the source code for this for another few days, but I will start describing all the details of the API calls and how to construct the OPML file on my mashup blog. One problem I discovered is that Technorati apparently limits the number of API calls per day, a fact that doesn't seem to be mentioned anywhere on their website. Until I can get someone there to raise this limit, I will have to leave this OPML file as it is. With a higher limit I hope to have it refreshing every hour.
The coming SAPI war
Posted on Wednesday, December 14, 2005
at 9:10 AM (permalink)
If the Web, at least the interesting part of it, is going to look like a huge collection of search engine items, then everyone is going to start building search engines. It's easy to predict a two-tier business model in the future, with major search engines offering API access to their code and data, and a second layer of application developers building cool mashups, remixes, aggregates, whatever, on top of this world wide data base. A major choke point is going to be the Search API (SAPI) used to access this data. It is far too early to tell which API will win, but it is in the adoption of a defacto standard SAPI that the war will be fought.
There is a tradition within the computer trade press to describe such competitive situations as wars. The wide range of military metaphors this provides makes it an obvious choice. Headline writers alone are immensely grateful for its use. We have had spreadsheet wars, and OS wars, and browser wars. Now we can have a search engine war with SAPI as the ammunition.
Search engines have long been tools of individual habit and taste. I use Google, my wife uses Yahoo!. There are toolbar schemes to lock people into one search engine, but users are still able to migrate or use multiple engines of their choosing. If there is a viable business model for an application layer on top of search engines, something still to be proven, then the battle for SAPI lock-in will become brutal, because it will make customer migration or multiple use more difficult. Users won't know, or care, what search engine is running under the hood. To be Web 1.0 about it, SAPI will become the superglue of search engine stickiness.
|