AltaVista Search Public Service


Search eXtensions 97 Family Search your Intranet Search your Workgroup Search your PC What can I do with them? Find your Solution Try it FREE Buy it NOW Tech Showcase Business eXtensions Network Affiliates		How large is your catalog? When do you expect to enlarge your index? How long does it take a URL to be indexed? After indexing, how long does it take a URL to be available in the catalog? How long does it take for an entire site to be indexed? What is the general indexing process? How are sites not submitted found? How does the spider crawl? What is indexed? What is the full-text? How are pages ranked? Does your engine detect and penalize for spamming? How does your engine cope with duplicate pages? Do you allow Boolean searches? Do you support stemming,? Do you support phrase searches? Do you support all/any searches? Do you allow include/exclude searches? Do you allow a search to be further refined or narrowed? Do you allow pattern searching? Do you allow proximity searching? Can I search for a particular URL? Can I search only in page titles? Can I search for pages from a particular domain? Can I search for text in hyperlinks? Can I search for images? Is there a 'find similar' function? How is punctuation handled? How is capitalization handled? Are there other unique search abilities? How are titles generated? How are summaries generated? What are the various display options? Can relevancy be controlled by the user? Any special features regarding non-American English words? How large is your catalog? Specifically, how many individual web pages have been visited, summarized and added to your index? A little more than 32M pages from over 476,000 servers, and four million articles from 14,000 Usenet newsgroups. We have opted to provide service to more people rather than pack the index with more pages, for the time being.… Also our index tends to cover more sites (breadth) rather than going deep in a smaller number of sites. When do you expect to enlarge your index? Soon, the exact time however we will not disclose nor will we disclose the size of the index we will go live with. Again, our index is strong in terms of breath and accessibility to large numbers of end-users. When we do go live with a new index, we fully intend to maintain this integrity. How long does it take a URL to be indexed, assuming it is submitted via an add form? The page is fetched immediately and its content is added to the index overnight. After indexing, how long does it take a URL to be available in the catalog after indexing? How often is the catalog updated? Our index is updated every evening. If a user submits a URL, it will be up in the index the next day. How long does it take for an entire site to be indexed, as opposed to a URL that is submitted? Is the entire site indexed? How many levels deep will the spider penetrate? Why might some pages not be indexed? The rest of the site is indexed whenever we run the spider. We run a full web crawl at least every quarter in combination with our continuous crawling feature. Potentially the whole site could be indexed (no restriction on depth), except that we have size limitations and stop the spider once the index has a certain size. What is the general indexing process? The submitted URL is indexed immediately. The rest of the site will be visited on the next spider run. How are sites not submitted found? Scooter (our spider) uses links to find pages off of other pages. Following links works much better than one could think. You can usually reach everything. How does the spider crawl? Can it only follow standard hyperlinks? Can it follow frame links? Can it follow image maps? Can it penetrate an authenticated site (assuming access is granted). What plans are there to improve crawl ability? The spider (Scooter) follows hyperlinks, including those hidden in the image maps. It does not follow inside protected sites (since most users of the index will not have access). What is indexed? The full-text of a document? Only key portions of a document? And assuming full-text is indexed, when are words ignored during searches? Is there any way to get around this, if so? The indexing software takes the text of a document and examines every word in it to create an index organized by word. It saves each instance of each word along with the URL of the page on which it appears, and information about its location in that document. That level of detail is necessary in order to do phrase searches, which depend on knowing the exact order of all the words within a document. There are no stop words: "the" is indexed! When making simple searches, very common words such as "the" are present in so many documents that they cannot contribute in any way to ordering the results, and are therefore "ignored" for ordering. They are available for queries if one truly requests it though: +the in simple queries will indeed bring back all pages with "the" in them. These words are also available for boolean searches in advanced searches. We also index all instances of a word, regardless of capitalization, as lowercase, and additionally indexes all the words with capitalization again, exactly as their typography indicates. That allows users to do a general search or to narrow the search for unique capitalization, as in trademarks. Similarly, we index under the English letter equivalent all instances of words with accented letters from non-English languages that use Latin characters. Once again, at the cost of enlarging the index, this approach gives users considerable flexibility in focusing their searches. In addition, no order is imposed on the enormous body of information to make it accessible. AltaVista Search simply indexes words. It takes the unstructured content of the Internet and, without adding some arbitrary or human-designed structure or categorization, makes it easy for users to find what they want. What is the full-text? Does it include comment tags, alt text, etc. ? All text, ALT text in images, links (hrefs and images), anchors, title, description and keyword meta-tags, applet and ActiveX object names, the page's URL, its host name (www.foo.com) and its domain name (com). No HTML comments. Similar treatment of Usenet postings with different keywords. How are pages ranked? Most engines favor pages with keywords in titles, then in high frequency on the pages. Are there other significant methods involved, such as page popularity as measured by links? How many keywords from the query match the page, how rare these words are, where they are (top of the page is better), whether they are near each other. Number of occurrences is not a big factor, to avoid spamming. Links are not a factor. Does your engine detect and penalize for spamming: word stacking, spoofing, multiple page submissions and other methods that you are no doubt familiar with? How is this combatted? Are some submissions reviewed? Do you watch for submission patterns, etc? Yes, we do spend a lot of effort on this, and we don't say much about the methods, to avoid counter-attacks. It is not based on content (i.e.we don't look for "sex" words) or review (too many pages). The idea is to find sites (or subnets...) submitting large numbers of pages with essentially the same content, or leading to the same content: large bulk, little content. How does your engine cope with duplicate pages? We eliminate them Only one copy remains. Do you allow Boolean searches? Any significant limitations, if so? Is there a graphical way to do this, such as through selection boxes? Yes. Full Boolean searches (AND, OR, NOT) and proximity (NEAR, phrases), arbitrarily complex expressions. Everything can be combined, so for example title:(apple OR pear) AND NOT (Steve NEAR Jobs) is a valid expression. We don't offer a graphic interface for our standard simple or advanced queries, which is clumsy and limiting: too complex for the truly naive user who will get usually what they want from the simple search, and too limiting for the power user. Our new feature, LiveTopics, uses a graphical interface allowing users to learn about or refine their queries using either the simple or advanced areas of the search engine. Do you support stemming, vs. having to use an wildcard character or other workaround? No stemming, our search engine is not limited to English. We have a "smart wildcard" which does roughly the right thing. Do you support phrase searches? "This is a phrase", and so are 415-617-3316, Win/NT and www.mumble.com. So either enclose the words in double quotes, or link them with any type of punctuation. No words are dropped from a phrase, so "to be or not to be" is a perfectly valid phrase. Do you support all/any searches, such as being able to specify that results should match all keywords or any keyword? Boolean searches can cover all of this and more. Simple searches by default are "any of", with ordering which brings the "all of" to the top. You can make them "all of" by adding plus signs in front of words. Do you allow include/exclude searches, such as being able to specify that results must include or must exclude a particular word? Yes, +this -notThat. And of course boolean searches do this and more. Do you allow a search to be further refined or narrowed? Can I start a search, then narrow down the search set using other criteria? Yes, the old expression is available in a window on the result page, just add new terms to it. Either by typing our using our new LiveTopics feature. Do you allow pattern searching? We have wildcard matching. Do you allow proximity searching? Can I say, find xxx keywords within xxx words of each other? Yes. NEAR. Can I search for a particular URL? Yes: url:foo.com/~blabla/mumble.html for example, or any subset of the URL. You can also look for just the host name as in host:imaginet.fr, or just a domain name as in domain:jp. The last one is a precise and efficient way to count all pages from Japan in the index for example. Can I search only in page titles? Yes. title:"home page", or title:NetGuide. Can I search for pages from a particular domain or exclude a particular domain? Yes, see above. It depends if you mean domain or host, but both are possible. Can I search for text in hyperlinks? Yes, you can search for links (which page points to ...) and anchors (text in link). So link:intel.com finds all pages pointing to a page on the intel.com sites. And anchor:"click here" finds all the pages with the dreaded words underlined (;-). Can I search for images? If so, am I simply searching for word in alt text or words in file names matching common extensions? The text in ALT is searchable as regular text, and the links in the images are searchable, as in image:comet.jpg. No content analysis, in fact nobody does it today. Is there a 'find similar' function? No. How is punctuation handled? Can I search for a word, including exact punctuation? All punctuation is seen as a separator, and so are spaces. So the engine treats as equivalent "a b c" a-b-c a/b;c a$b(c), you get the idea. How is capitalization handled? Will bill clinton find Bill Clinton? Will Bill Clinton find bill clinton? Any particular notes? Lower-case matches anything: bill clinton finds Bill Clinton. Any capitalization only matches exactly, so Bill Clinton matches only itself, not bill clinton or BILL CLINTON. Something similar with accented letters in European languages: the same word without the accents matches the word with the accents, so I have a chance to search for the name of some Swedish guy without trying to coerce my keyboard into giving me an o with a slash though it. Are there other unique search abilities? You can search by dates. How are titles generated? If from title tag, what happens if this is blank? From the title tag. If empty "No Title". How are summaries generated? Please indicate if these are from the first xx characters, vs. some means of making an abstract. From the first few lines of text, cleaned up a bit to look better. If the description metatag is present, it is used instead. What are the various display options and what they allow: title only, 1 line description, 4 line description, etc.? title only, 1 line description, 4 line description, etc. - Compact: title, date, abstract, but on one line. - Detailed: title, URL, abstract, size, date. Can relevancy be controlled by the user? Can certain words be assigned greater weight? No/Yes - depends on what you mean. Relevancy is an interesting topic that is turning into a buzzword. We give results/data that is presented with no filters or assumptions about the user intends to see. We give the 'power to the user' to control his or her own destiny when searching for information. Our new feature, LiveTopics, allows even the beginner user to very quickly get to what they are looking for quickly (AKA 'relevant' information) through a new easy to use point-and-click UI. Any special features regarding non-American English words? We treat properly all iso-latin-1 (Western European) languages, and will support all others soon.

Digital Equipment Corporation
Copyright © Legal
AltaVista Internet Software, 30 Porter Road,
Littleton, MA Fax: (978) 506-2017