Search Engines have been a great involvement in the evolution of the WWW, since its first steps in the early 90’s: simply put, the technology behind them changed the way information retrieval operated on the WWW, and gave faster access to an increasingly larger corpus of ever-expanding text-based information. Not all search engines operate the same, of course: in its early days, Yahoo was based on web page cataloguing done by human labor.
Subsequent search engines were based on automatic cataloguing and indexing of web pages, performed by automatic web crawlers (aka web spiders, web robots and so on). Furthermore, additional ranking and relevance techniques are being applied on the automatically-indexed web page dataset, in order to further index each web page’s content, as well as conceptual relations among different web pages. Google’s PageRank algorithm falls into this category, by analyzing existing link structures among websites. However, the technology employed by major search engines is based on the same principles:
• Automatic keyword-based indexing by web crawlers, and
• Keyword-based user search. The basic notion behind most similar search methodologies is keyword-based search.
Documents are retrieved, indexed and queried based on keywords/terms located on their content. These keywords determine the concept of each web page, which are then classified accordingly. In other words, traditional search engine technology is based on text analysis.
Textual documents are by nature highly unstructured sources of information, so the above methodologies come with standard limitations. For instance, imagine a web page about wind surfing, which does not contain the term water sport. As a result, it would show in searches about wind surfing, but not in searches about water sports, even though wind surfing is a water sport.
This kind of relation is not taken into account by traditional search engines, as it is based on the semantics of the two terms, and not syntactic similarity between the keywords (for an intro on semantics in the context of search engine technology, also see http://webandrank.com/web-ranking-and-semantics/).
Today’s WWW (Web 2.0, as it is sometimes referred to) is much different to its early 90’s predecessor. A broader discussion on the way the WWW has evolved over time is out of the scope of the post, so we will just focus on the 2 following points:
• Web content is much richer in volumes, media types and data formats.
• Rapid progress in semantic data technology.
Rapid advances in internet connectivity and computing equipment have made the WWW accessible to increasingly expanding portions of the worldwide population. In addition, today’s WWW is much more collaborative by nature, and web content authoring is more widespread across the population. Blogs, forums and social network websites are prominent examples of these trends. As a result, web content nowadays is much larger in volume, more diverse and much richer in concepts and relations across data sources.
The rapidly involving Computer Science research field of semantic technology, on the other hand, researches algorithmic methodologies for cataloguing, classifying and examining relations among different pieces of information. Semantic Web technologies focus on a variety of research problems that can be categorized in 2 broad areas:
• Compiling information in data formats that provide semantic structure to data items (a variety of such formats exist; from plain XML to heavily-structured RDF-S variants).
• Extracting semantic relations from unstructured information sources.
We strongly believe that these forces will be instrumental in the future of web search.
A NEW BREED OF SEARCH ENGINES
Imagine yourself in front of your favorite search engine. Imagine you want to find out which one of all past American presidents is the one that has died the youngest. Chances are you cannot just place a query along the lines of “which American president died the youngest”? Unless this specific piece of information is textually written in one specific source / web page (e.g. “President X is the youngest one to have died in American history to date”), then this particular piece of information cannot be retrieved by a single search query.
Of course, you could compile the information in a semi-automatic way, by following a variant of the following process:
1. Run a query for “all American presidents”. Make a note of the list.
2. For each (deceased) person on the list, run a query along the lines of “president X date of birth”.
3. Repeat Step 2, for date of death also.
4. Subtract the 2 retrieved dates, to calculate the age.
5. Compare all ages, to determine the youngest one.
It sounds like a straight-forward process. In fact, in any programming language, it could be written within a single for-loop statement. However, most search engines are not capable of compiling such a composite answer. This exactly is the way this new breed of search engines can operate: compile composite answers out of single queries; answers that in turn, make use of multiple data sources underneath, and the relations among them. In addition, the returned result is no longer a single document that contains the monolithic answer to the query, but rather a calculated answer, out of all the above. This entails that the search engine is capable of performing the following tasks:
• Crawling and indexing web documents.
• Extracting semantic data types out of the information that is contained within each document.
• Combining the extracted data items among its data repository, according to pre-defined semantic rules.
• Storing information and relations across all semantic data items, so they can be queried in a composite fashion.
For instance, the data indexing system of such a search engine would crawl each web page, and extract all human names, which would then be marked as entities. If age is mentioned, then the property : Age of the class would be set as well. This property could of course be extracted by a different document, where the same entity is extracted. The resulting data model is then stored in the form of a graph network, where links do not longer represent links across web documents, but rather semantic relation links among extracted semantic data types.
An example semantic graph network is shown on the following Figure:
Such search systems are not the norm yet, but working prototypes do exist. A prominent example is Wolfram Alpha, a search engine (or calculation engine, as it is referred to) developed by British scientist Stephen Wolfram, also creator of the Mathematica software suite. We will keep you posted on new exciting services in this area.