OK, well I’ve spent some time reviewing this new specification.
It’s a mixture of a couple of useful new qualifiers to the old robots.txt standard and a lot of anally-retentive control-freakery written by people who still don’t get “the internet”.
The good points:
- Extends robots.txt to allow site owners to define sets of folders/files by regular expression and apply instructions to bots “en masse” – nice idea.
- Extends robots.txt to add time-based crawl permissions (e.g. you can specify that the folder “/news/2007/” is only indexed until the end of 2007 – after which you’d want search engines to drop the last years news from their indices and start crawling the 2008 news instead
The bloody terrible points (control-freakery and lack of understanding):
- Tries to introduce a new protocol to disallow an aggregator site (e.g. Google) from displaying a thumbnail of a page – Why?!?
- Tries to introduce a new protocol to disallow an aggregator site (e.g. Google) from displaying a snippet of content – Again, why?!?
- Tries to introduce a new protocol that allows publishers to set the exact length (to the character) of any snippet shown on an aggregator site. Oh dear god, why are publishers so anal about this!
- Tries to introduce a new protocol which demands that aggregators don’t parse the content they’ve crawled outside of a specific context. No, no , no!
- Tries to embed take-down notices to crawlers within robots.txt. Oh, FFS – this is publishing at it’s most over-controlling (and therefore self-destructive).
Thankfully there is no legal requirement that the search engines take any notice whatsoever of this new “technical framework”.
I kinda hope that the tech community takes the good points out of this spec (pattern-matching, time constraints etc) and just upgrades the good old robots.txt standard, ignoring the worts excesses of control-freakery that the publishing industry have slipped in.
a bit of blog reaction to ACAP:
“…Boiled down to the bottom line, I can’t help but sense that the intended shift in responsibility that appears to be associated with ACAP could lead to an entire new wave of litigation and possible information restrictions — enriching lawyers to be sure — but quite possibly being a significant negative development for Internet users in general.”
“…The new protocol focuses entirely on the desires of publishers, and only those publishers who fear what web users will do with the content if they don’t retain control over it at every point…ACAP might well be adopted by a lot of publishers (although not, so far, by any search engines anyone has heard of), but we’ll all be a little poorer as a result.”
“It seems like a weak electronic online DRM – with the vague promise that in the future more ‘stuff’ will be published, precisely because you can do less with it…”
In the interests of fairness I tried to find a positive article about ACAP, but there’s absolutely nothing.
Luckily this ACAP protocol does not have the support of the search engines and so is likely to fail and die.
The ACAP site does brag that “Major search engines are engaged in the project. Exalead, the world’s fourth largest search engine has been a full participant in the project.”
Exalead? Who the hell are they? If you can’t claim the involvement of Google and/or Yahoo if any search-engine specific project then you’re dead on your feet.
And in the case of ACAP, I’m glad.