Google is currently engaged in an ongoing legal battle with the Belgian newspaper group Copiepresse. Well, I say engaged, but Google lost the first round because they don't appear to have defended themselves. I assume that they'll appeal and that litigation will continue.
Copiepresse were quick to move with their newly acquired advantage and applied pressure on MSN, which resulted in the removal of their content from MSN's indexes.
The gist of the situation, (as I understand it), is that Copiepresse are attempting to use the legal system to force Google (et al) to pay them for indexing the content published on Copiepresse's members (newspaper) websites, and republishing headlines and summaries on Google News and in search result pages.
Their claims involve copyright infringement and breaches of European law regarding caches, and that permission was not granted by the publisher to access the content.
Copiepresse didn't ask Google to remove its content, nor did it use the Robots Exclusion Standard, (the accepted method of stating which content is to be indexed), to specify which content it wanted in, and out, of the index. I'm summarising here, it seems this is largely because they don't want their content removed from search engine indexes - they want a cut of the money that search engines make from advertising.
Google is unlikely to pay for access to this content.
The World Association of Newspapers (WAN), the European Publishers Council (EPC) and the International Publishers Association (IPA) have developed ACAP, the Automated Content Access Protocol. This standard aims to provided an extended set of rules for access to content , but with better resolution than the Robots Exclusion Standard (RES). One feature will be specify which content/media should be paid for.
ACAPs FAQ states that existing protocols can't be used, (it refers to the RES) because they don't permit conditional access - but this is misleading, and I'll explain why shortly. Read the FAQs yourself, you'll spot a common trend, they're a tad defensive about ACAP being a financially motivated standard.
Copiepresse, WAN, IPA etc all represent businesses that publish their content online.
Specifically these are websites, all of which obey the HyperText Transfer Protocol (HTTP).
HTTP is more than just a delivery system for documents and media, it covers access control, caching, content negotiation, and payment. Yes payment, you knew that, right?
Amongst Copiepresse's legal arguments were references to their content being cached illegally, or at least inappropriately, and their content being used without their permission. I suggest that they've already got control of the mechanisms they need to do manage those problems, and that their case therefore is largely spurious, at most an attempt to limit damage done by an earlier mistake (of theirs, in not properly controlling access to the resources).
"402 Payment Required" is the error code returned to the client when access to the requested content requires payment. 402 is currently marked for future use - but the time for that future is rapidly approaching, I suggest the content owners push for the finalisation of protocols associated with this error code, rather than announcing new standards like ACAP.
no-cache, no-store, max-age - HTTP has quite complex cache-control directives available to it, I understand that most search engines obey these directives.
"401 Unauthorized" is the header returned when the client has requested a URL for a protected resource, which it has not already been authorised to access. It's the key to HTTP based authentication and authorization.
The latter two are in widespread use, many websites use the standard HTTP error codes to communicate that there is an access control system in place (to the client). In fact the HTTP protocol can be said to be the thing that underpins all communication on the web (N.B. not the internet).
The fact that this is the case, and that by using the HTTP standard both clients are agreeing to communicate in a specific way, governed by specific rules, led me to an intriguing thought.
After reading Groklaw's interview with Copiepresse's representative, it seemed to me that everyone is focusing on the robots.txt file (from the RES), and that content owners are unsatisfied with the resolution and detail that it's able to specify, that content readers can claim a 'right to access' because they ask for permission by reading the file first.
We have a situation where the server (e.g. content owner) and client have agreed to communicate in a specific way, on a request-response basis, using a defacto standard. The server has by dint of the protocol in use complete control over the access to all of it's resources. The access control is already available, and in use by the server - there's no need for ACAP (a replacement for RES).
I suggest that it's the content owners' responsibility to ensure that they monitor and control access to their resources, given that they are already using tools which permit them to do so.
I don't believe that Copiepresse have a case, (though admittedly I'm no lawyer, or judge), because they have already put an automated system in place that governs access to their content, the very nature of HTTP is such that they can deny access - or specify that payment is required, or state that a resource must not be cached and then also determine whether a resource should be delivered to the client (or not).
Search engines are not accessing resources using illegal methods, they are only exploiting access that is already negotiated, and permission already given.