Search engines make the Internet convenient and enjoyable. Without them, people might have trouble getting the information they are looking for online because there are vast amounts of websites available, but many of them are just titled after the author’s whim and most of them sit on servers with cryptic names.
When most people talk about searching the Internet, they are actually referring to Internet search engines.
Early search engines indexed a few hundred thousand pages and documents and received perhaps a few thousand queries each day. Today, a major Internet search engine processes massive amounts of web pages and responds to millions of searches every day. In this chapter we will tell you how these important tasks are performed and how the search engines put it all together so you can find all the information you need online.
When most people talk about searching the Internet, they are actually referring to Internet search engines. Before the web became the most visible aspect of the internet, search engines were already able to help users find information online. Programs with names like “Archie” and “Gopher” managed the indexes of files stored on servers connected to the Internet, greatly reducing the time required to find pages and documents. In the late 80’s, getting the most out of the Internet meant understanding how to use Archie, Gopher, Veronica, and others.
Today, most online users limit their searches to worldwide websites, so in this chapter we limit ourselves to discussing search engines that focus on web page content. Before search engines can tell you where a file or document is, it has to be found. To find information from the vast amounts of existing web pages, search engines use special computer software robots called spiders to compile lists of what is available on web sites. Whenever a spider creates its lists, the process is known as web crawling. In order to be able to build and maintain a good word list, the spiders of a search engine have to search through a large number of pages.
How exactly does a spider begin its journey on the web? The usual starting point is the lists of well-used sites and servers. The spider starts with a known website, indexes what’s on its web pages, and follows every link that’s on the site. In this way, the spider system begins to visit and spread across the most popular parts of the web very quickly.
Google was originally an academic internet search engine. The paper describing the system’s construction (written by Lawrence Page and Sergey Brin) gave a good overview of how fast their spiders could possibly operate. They built the first system that used multiple spiders, often three at a time. Each spider keeps about 300 connections to websites open at any one time. At its peak, their 4-spider system was capable of scanning over 100 pages per second, generating around 600 kilobytes of data.
In order to keep everything running quickly, a system had to be created to provide the spiders with the necessary data. The first Google system had a server focused on serving URLs to the spiders. Instead of using an online site provider for a domain name server that translates a server name into a web address, Google got its own DNS, so delays were minimized.
Whenever a Google spider scanned an HTML webpage, it noted a few things:
What was on the website
Where the respective keywords were located.
Words appearing in subtitles, titles, meta tags along with other important positions have been recorded for preferential consideration after a user has performed a search. The Google spiders were created to index every important sentence on an entire page, omitting the articles “a”, “an” and “the”. Other spiders just do it differently.
These different approaches are an attempt to make the spider faster and allow users to find its information more competently. For example, some spiders keep track of what’s in the titles, subheadings, and links combined with the top 100 most used words on the page and every word in the very first 20 lines of text. Lycos is believed to use this method to spin the web.
Other systems, such as AltaVista, take a different approach, indexing every single word on an entire page, including “a,” “an,” “the,” along with other “insignificant” words. The comprehensive aspect of this method is achieved by other systems in the interest they focus on the invisible part of the web page, the meta tags.
With the major search engines (Google, Yahoo, etc.) responsible for over 95% of online searches, they have become a true marketing powerhouse for anyone who understands how they work and how to use them.
Thanks to Napoleon H Russ | #search #engines