Crawling for frames

From edit.tf development blog. Comments are welcome via email.


Another benefit of having the data in the URL is that, to some extent at least, the web will have scattered across it teletext frames in the form of edit.tf (and zxnet.co.uk) URLs. These URLs are trivial to identify and can be converted easily into frames.

The Google search operator site: seems to promise to return all URLs in Google's index with a given site or domain, but in our case, this doesn't work, perhaps because of the use of the URL fragment identifier (the part of the URL after the # symbol). This would make sense, since the conventional use of the URL fragment identifier is to identify a part of a document. Neither DuckDuckGo nor Bing do this either.

A web archivist could run a web crawler or network of web crawlers to wander over the web, looking for these URLs, and harvesting any that it sees. Though the web is big, perhaps teletext frames can be found on a small and predictable part of it, and a heuristic could guide the crawler. What remains of the HTTP referer field in edit-tf's server logs after the fragment identifier is removed, and where it is even passed by the browser at all (because of privacy and security threats), could nevertheless be used in seeding the crawler or informing such a heuristic.