Trackback -- Seth Gordon -- ropine.com

Notes on the Background of Back-Links

In the Beginning was the Link

Every click on a hyperlink sends a message with a return address. That address, the referrer, is the URL of the page containing the link. It is also a compromise between two different visions of hypertext.

When Tim Berners-Lee was designing the protocols for the World-Wide Web, one issue he contemplated was: Should links be one-way or two-way? On the one hand, since a link communicates information about the relationship between two documents, readers of either document might want to know about the relationship, so links should work in both directions. On the other hand, if two documents are being managed by two different authorities (e.g., The New York Times and Spinsanity), a protocol that requires two-way links would require both authorities to cooperate with each other to maintain the links.

ENQUIRE, an earlier hypertext system that Berners-Lee wrote, used two-way links, but this was a system where all the linked information was on one server. For the Web, Berners-Lee adopted a technique proposed by Phillip Hallam-Baker: use one-way links, but allow the server to find out where a page is being linked from. The server of the page being linked to can use this information, or ignore it.

Using the Referrer

Any system for CGI scripting has a way to access the referrer (or, as the HTTP standard calls it, the "referer"). In Perl, for example, using the CGI module, you can do it like this...

#!/usr/bin/perl
use CGI qw/:standard/;
print header("text/plain"),
  "To get here, you clicked on a link from ", referer();

...to generate a page like this one. This is how, for example, many custom 404 pages point back to the page with the dead link.

If you don't mind information from complete strangers filling up your hard drive, you can write a slightly more complicated script, and make a permanent record of pages linked to your own:

#!/usr/bin/perl -T
use CGI qw/:standard/;

# ignore URLs with invalid characters
if (referer() =~ m|^([-\w?.+\%/:&]+)$|) {
  $r = $1 . "\n";
} else {
  $r = '';
}

open REFERIN, "./referers" or die "Can't open the referers file for input: $!";
@referers = <REFERIN>;
close REFERIN;
if ($r) {
  $foundit = grep { $_ eq $r } @referers;
  unless ($foundit) {
    open REFEROUT, ">>./referers" or
      die "Can't open the referers file for output: $!";
    print REFEROUT $r;
    close REFEROUT;
    push @referers, $r;
  }
}

print header, start_html("Example 2"),
  p("Links to this page have been followed from the following URLs:"),
  ul(li(\@referers));

Now, if you create a Web page with a link to this page, and click on the link, you can see your page's URL in the list.

Using the Referer, Practically

That script is fine as a proof of concept, but it's not very useful. If our page accumulates inbound links over a long period of time, the only way to trim the list would be to manually edit the "referers" file. If a link to our page appears in a Slashdot comment, which can be viewed through a variety of URLs (as a standalone comment, as part of a threaded comments page, as part of a nested comments page, as part of a nested comments page with threshold 2, etc.), then every one of those URLs could appear in our list. Most importantly, since this is a CGI script, the only way we could append a list of inbound links to a regular HTML page is by turning that page into a CGI script.

To work around these problems, a number of programmers have provided self-contained scripts that can keep track of inbound links and display them on your page, with a minimum of effort on your part. For example, Stephen Downes has a system that is based on Javascript: you add a little Javascript to your page, and when someone reads it, if his or her browser supports Javascript, it calls up a CGI script on his site, and where the HTML source had Javascript, the browser shows a list of links. So, even though the server for this page does not support CGI scripts, if your browser has Javascript enabled, you can see a list of links to this page right here:

Log files provide another source of referral information. If your Web server uses the Combined Log Format, its log file should have lines like

192.168.1.1 - - [01/Apr/2003:14:58:54 -0500] "GET /tb/example2.pl HTTP/1.1" 200 390 "http://ropine.com/essays/trackback.html" "Mozilla/5.0 Galeon/1.2.5 (X11; Linux i686; U;) Gecko/20020623 Debian/1.2.5-0.woody.1"

The field marked in boldface is the referrer.

Mark Pilgrim has written PySiteStats, a Python system that generates reports from access logs. Among other things, this system can produce a list of referrers.

From the Referer to Trackback

The folks at Movable Type have created a lot of buzz in the weblog community with their Trackback standard. In a weblog that uses Trackback, every weblog entry or category has a "Trackback URL". This URL doesn't point to a normal Web page, or even to a CGI script that outputs HTML. Instead, the authors of other weblogs can use tools on their end (such as, ahem, Movable Type's blog-entry editor) to "ping" this URL with information about the relationship between two blog entries.

Why would anyone go through all this trouble to announce that they've commented on someone else's weblog, when they have HTTP referrers for free? The Trackback protocol has a number of features that ordinary referrers don't have.

If the weblogs at both ends have Trackback configured properly, commenting on another weblog using the Trackback protocol is as easy as replying to an email message.
With most weblog systems, each entry in a weblog can be read through at least three URLs: one for the weblog's main page, one for the entry's "permalink", and one for the weekly or monthly archive that contains the entry. A referrer-based system cannot reliably tell that three URLs containing links to a certain page are actually three views of the same content. A Trackback ping will send the permalink for the entry containing the comment, and nothing else.
The Trackback ping may also contain an excerpt from the comment being made, the title of the entry with the comment, and the title of the blog with the comment. The referrer contains only a URL.
Weblog categories, as well as individual entries, can contain Trackback URLs. For example, The Red Kitchen, a cooks' weblog, has a Trackback-based system for guest recipe submissions, as well as Trackback URLs attached to individual recipes.
A referrer can announce a link between two Web pages, but not the intention of the person making a link. A Trackback ping sends a clear signal: "I made a comment on what you wrote, and I want you to see it." It's the difference between observing who can hear what you say, and knowing who has joined your conversation.

On Beyond Trackback

Back at the dawn of the Web, when Berners-Lee wondered whether or not to allow two-way links, he also wondered: Should there be different types of links? Again, when he designed HTTP, he chose not to imitate the design of his earlier system.

The two-way links of ENQUIRE could describe six kinds of relationships:

uses / used-by
includes / part-of
made / made-by
describes / described-by
background / detail
similar-to / other

But in HTTP, a link is a link is a link, and any information about what kind of link has been made has to be carried "out of band".

The Trackback protocol provides one way to carry this information—to announce that one page comments on or is commented on by another. More importantly, the weblogging tools that implement Trackback make it easy for users to exploit the protocol.

What's next? What other protocols and authoring tools can enrich our experience of reading, writing, and linking on the Web? Which of these tools will actually become popular with the users, and which will remain stuck in the land of "if only everyone did this"? Stay tuned, true believers....

Seth Gordon // sethg@ropine.com // April 2003