An-acronyms: The web and the structure of information

In the beginning, the web was generally thought of as a sort of library, where you could go to any shelf and pull a piece of paper down and look at it. In its simplest form, that is what it still is: type in a URL, get a document.

However, that form of web is all but useless. If you don't know the URL for something, you have no way of getting to it short of trying random URLs hoping to stumble across it by chance. That's where indexes come in.

Indexes arrange web content in some meaningful way, and provide a user with a method of searching for and finding specific content. The best known is Google, but there are others. Time was, there were probably ten different ones. Some have disappeared, while others have changed forms and now use Google to power their searches.

What indexes provide is an entry point. Most websites are not single pages, but are rather collections of pages with internal and external links. But you need an entry point, and you get it either from a link provided elsewhere, or by searching an index such as Google.

Once you're in, you can follow the links around. If there are links out, you can follow those into other linked webs of information. This is powerful technology, the more powerful because it is simple.

There is a tendency to think that everything that exists has been indexed by Google. This is not true, but it might as well be. There are estimates for the amount of unindexed space that is floating around, all the personal websites people have built that Google doesn't know about, and they vary, but smart people guess that less than half of the web is indexed.

That's a lot of information that, effectively, doesn't exist. I'll be the first to admit that well over 90% of it is probably puerile, but that's not the point, is it? Well over 90% of the indexed space is, as well. But it's all useful to somebody. That is isn't indexed isn't Google's fault, because Google has no way of knowing about content without some form of reference to it. Google follows links, and Google will index a site that you submit to it, but if there aren't any links from a site Google has indexed and the site hasn't been submitted, Google has no way of finding it.

This, in itself, isn't a problem. The problem occurs when it is assumed that the indexed space is all there is. The conclusion is that if a thing doesn't exist on Google, then it doesn't exist at all. Those of use who have been around from the beginning remember when you would search with several search engines: you might not find something on the first two you tried, but it might be there on the third. With Google's resurgence (Google was around for quite a while as a not-very-good engine), the others are disappearing. I think AltaVista, long the first choice, now uses Google to power its searches. Lycos is probably similiar.

The problem isn't search engines, though. It's the structure of information and systems of information. The web is a vast unordered sea of info-chaos (this is a good thing, BTW), in many ways the ultimate in democracy. It is no longer the exclusive playground of the privileged. A used computer and a dialup account can be had for less than the cost of cable television. If I had to choose between them, I know which I'd choose. I can get Survivor results on the CBS site.

As information systems go (which is what the web is, really), the web has little structure. Pretty much anyone can add content to it at any time: blog comments, for example. We (the web community as a whole) depend on search engines to organize that content for us. Increasingly, and now almost overwhelmingly, that engine is Google.

I hold that the dangers of Google are real. Having one world search engine carries with it more risk than having one world supplier of operating systems.

That's a provocative statement.

Suppose Google were to log the IP of any machine that requested, say, 'kiddie porn'. That's not too bad, we could all get behind that one. What if it were to log requests for information to do with bombs and airlines? Student radical organizations? Left-wing pressure groups? Militia organizations? Note: Google probably already does this. And if they don't, it could be added with minimal, and I mean minimal, effort.

Suppose that, instead of merely tracking requests, Google restricted some kinds of content. It is possible that they do this already, although if they do, they don't talk about it. For obvious reasons, I guess. Again, we would all agree that restricting access to child porn is a good thing, but what if it extended to left- or right-wing groups? Or groups critical of the government?

Google includes a news index. It's a very useful service. You just go to it and it provides multiple links to news stories. I don't know how it decides which of those stories are most important, possibly by the numbers of stories it finds that reference the same subject. If you spend a little time reading, you'll generally get a more balanced view of whatever the story is about, because you can read stories from North American, European, Asian, and Australian sources.

But what if Google were to tweak the algorithm so that stories critical of the government were to either rise or fall in the rankings? Or to leave some stories out altogether?

I'll have to stop there for now. As a way of wrapping this entry up, however, let me say that I don't think that Google is doing anything untoward right now. I'm just saying that they could, and that's the danger.

Gah. I started this as writing practice. I've done reasonably well, most days, managing to write a coherent piece. This particular one blows chunks. My apologies. 1000 words, unordered, and I still didn't say everything I wanted to say. I'll do better tomorrow.

(But hey, that's the beauty of this exercise: you get to see everything. I include the failures.)

An-acronyms

The web and the structure of information

No comments:

Popular Posts