Volume 4, No. 4
October 2000
|
|
|
|
|
|
|
|
|
| Search Engines Revisited
by Gabe Bokor |
s the World-Wide Web continues go grow exponentially at 50% a year, an ever-increasing proportion of human knowledge is becoming available on line. The number of Web sites today is estimated to have exceeded 1 billion. Finding a specific piece of information among such a mind-boggling amount of data would be impossible without powerful software that automatically and periodically combs the Web, collecting and indexing information.
The number of search engines has reached the point that there are now search engines for searching for other search engines. |
Search engines are the tools allowing one to find the proverbial needle in the haystack (see my article in the January 1999 issue of the Translation Journal). Since the time that article was written, new, more powerful search engines have appeared and new features have been added to existing ones. The number of search engines has reached the point that there are now search engines for searching for other search engines (e.g. Directory Guide). Like on the Web in general, there is constant change on the search engine scenemergers, acquisitions, new features, new interfaces, and new URLs will certainly make some of the information contained in this article obsolete by the time you read it.
Which is the best search engine? No simple answer can be given to this question, since different engines excel by different criteria. These criteria include
-
Size (the number of sites or pages indexed)
-
Speed (how fast the engine can find the information requested)
-
Relevance (how many of the "hits" are relevant to the actual purpose of the search)
-
Update rate (how current is the information contained in the search engine's database)
Because some of these criteria are mutually contradictory, experienced researchers use different engines and different tools for different types of search.
Size
According to Search Engines Showdown, the largest engine currently (July 6-7, 2000) is iWon with an estimated 356 million pages indexed (it also gives out prizeshence the name), followed by Google! with 355 million pages and AltaVista with 331 million pages.
Of course, size is not the only factor that determines the number of hits for a given search. If the engine supports (and the researcher uses) the Boolean operator OR (AltaVista and Northern Light do), a search for "cars" can be extended to include "automobiles" and "motor vehicles," for example, linked by the OR operator, for a greater number of relevant hits. Excite automatically includes some synonyms in the search.
Truncation has the same effect of increasing the number of hits. Truncation in the form of a wildcard (usually "*") allows one to find occurrences of a word in different flexed or composite forms. Northern Light supports "*" as the wildcard for any number of characters or "%" for a single character. AltaVista and Hot Bot support "*" wildcards; (Northern Light and Go (formerly Infoseek) provide automatic plurals. iWon uses optional automatic word stemming in Advanced mode.
Speed
To the best of my knowledge, no ranking or quantitative analysis of search engines by speed is available, since the speed of each search greatly depends on the type and complexity of the search (see below), the time of the day, and the user's system. However, Fast, a new engine using the same software as Lycos, is usually considered the fastest for most types of search, beating the former speed champion AltaVista.
Relevance
Size is an important search engine feature, but a search result yielding 300,000 hits, 99% of which are irrelevant to the subject at hand has little practical use. The relevance of the hits achieved depends greatly on the search tools provided by the engine and used by the researcher. For example, if the engine supports proximity search (Boolean operator NEAR), it will provide more hits than is possible with phrase search only, and more relevant hits than by searching for individual words. Example: If I wish to search for "Arabic dialects," a proximity search (Arabic NEAR dialects) will yield the occurrences of both "dialects of Arabic" and "Arabic dialects," while a phrase search ("Arabic dialects") will ignore "dialects of Arabic," and a simple search by words (Arabic AND dialect) will also find any articles dealing with "Arabic horses" and "dialects of German"). Of the major search engines, only AltaVista and Web Crawler (now owned by Excite) support the proximity operator. Other Boolean operators are supported in some form by most search engines, either explicitly as AND, OR, AND NOT (Alltheweb/Fast and Google do not support OR), via menus, or via the signs + and -, which must be typed immediately before the word to which they refer.
Since some engines (AltaVista, Excite) use the OR Boolean operator, while others use the AND operator as their default, make your search as explicit as the given search engine allows for best results.
Another feature that enhances relevance is selection by language, a feature that is particularly useful when searching for abbreviations. Most search engines today support language selection, at least in the Advanced search mode (Go and Northern Light do not).
AltaVista, Fast, Go, and Northern Light allow you to restrict your search to certain URLs (for example, to .de sites for German sites only or .gov sites for government sites only) by adding "URL:de" to your search word(s), for example.
Most search engines sort their hits by relevance, although the engine's judgment about relevance (usually based on the number of occurrences of the word on the page) may not be the same as yours.
The date of creation of the page (shown by Go and, when selected as an option, by AltaVista), is a useful piece of information to evaluate the relevance of the information. Go also gives a percentage as a measure of relevance for each hit.
In my experience, Fast has the best record for a high degree of relevance among the major search engines.
Update Rate
The number of dead links, i.e., links that are no longer valid, is a good measure of the update rate of a search engine. Search Engine Showdown rates MSN Inktomi lowest in dead links (i.e., highest in update rate) at 1.7%, followed by Fast and Hot Bot at 2.3% each, with AltaVista in the last place with 13.7% dead links.
Metasearch Engines
Metasearch engines such as Dogpile and MetaCrawler search several engines at the same time. In addition to the World-Wide Web, some metasearch engines also let you search newsgroup (Usenet) and FTP sites. Obviously metasearch engines are limited to those features that all the search engines they query support.
Features & Performance of Some Search Engines
In order to have some degree of objective comparison among the major search engines, I searched the term "Translation Journal." A hit was considered relevant if it linked to an actual page of the Translation Journal or to a page containing such a link.
AltaVista: Still one of my favorites, mainly due to its support of Boolean (AND, OR, NEAR, AND NOT) searches and wildcard truncation. Each hit contains about a line of text from the page. You can set the Result Options to display or not to display Description, URL, Last Modified, Web Site Language, Translate (link to BabelFish), More Pages From This Site, Related Pages, and Company Facts.
A search for "Translation Journal" provided 726 hits; 6 relevant hits among the first 10.
Fast (formerly Alltheweb): Although it does not support full Boolean searches, you can set three words or phrases, each of which "Must Be Included," "Should Be Included," or "Must Not Be Included" on the page. You can also select language, encoding (Cyrillic or Arabic, for example), and domain filters.
Each hit provides about two lines of text from the page plus the URL.
A search for "Translation Journal" yielded 1071 hits; 9 of the first 10 ones were relevant.
Google!: You can search by word or phrase, exclude pages that contain certain words, select language, include or exclude certain domains. You can also find pages that link to a certain page or are similar to a certain page.
The hits are displayed by category. About a line of text and the URL are displayed. A link tells you if the page has been cached by Google. A separate link takes you to "Similar Pages."
A search for "Translation Journal" yielded 1380 hits; 6 of the first 9 ones were relevant.
iWon: Three words or phrases may be selected, each of which "Must Be Included," "Should Be Included," or "Must Not Be Included" on the page. Word stemming is an option and so is selection by age. No selection by language. .
About a line of text and the URL are displayed for each hit.
A search for "Translation Journal" yielded 1829 hits; 10 of the first 10 ones were relevant.
Go: Three words, names, or phrases may be selected, each of which "Must Be Included," "Should Be Included," or "Must Not Be Included" on the page. No search by language. Option to show summaries.
About two lines of text, relevance (as estimated by Go in %), date, size, and URL are given for each hit. Separate links take you to "Similar Pages," "More Results" (other pages of the same site), and Babelfish ("Translate").
A search for "Translation Journal" yielded 249 hits; 3 of the first 10 ones were relevant.
|
|
|