Information Fusion with ProFusion*
Susan Gauch
Department of Electrical Engineering and Computer Science
The University of Kansas, Lawrence, KS 66045
sgauch@eecs.ukans.edu
*http://www.eecs.ukans.edu/~sgauch/ddih.html
Abstract
The explosion of World Wide Web pages led to the development
of search engines to manage the information overload. Today, there is a
mini-explosion in World Wide Web search engines, which has led to the
development of ProFusion. Military personnel, like other users, do not have
the time to evaluate multiple search engines to knowledgeably select the best
for their uses. Nor do they have the time to submit each query to
multiple search engines and wade through the resulting flood of good
information, duplicated information, irrelevant information, and
missing documents. ProFusion is a meta search engine which sends user
queries to multiple underlying search engines in parallel, retrieves and
merges the resulting URLs. It identifies and removes duplicates and
creates one relevance-ranked list. If desired, the actual documents can be
pre-fetched to remove yet more duplicates and broken links. The
performance of ProFusion compared to the individual search engines and
other meta searchers is currently being evaluated. A paper submitted
to WebNet '96 which describes the existing prototype (and the prototype
itself) are available from http://www.eecs.ukans.edu/~sgauch/ddih.html.
Recent extensions allow ProFusion to operate as an ongoing
information filtering system which notifies users of new developments in
their field. Ongoing work is focusing on making search processes more intelligent,
creating independent search agents which retrieve an analyse the documents
themselves, not merely the document URLs.
The result of these two thrusts will be an intelligent search assistant
which periodically searches the Web, collects the retrieved documents,
compares the results with results already obtained, and notifies the user
only of new and interesting results.
1. Introduction
There are a huge number of documents on the World Wide Web,
making it very difficult to locate information that is relevant to a user's
interest. Search tools such as InfoSeek[12] and Lycos[13] index huge
collections of Web documents, allowing users to search the World Wide
Web via keyword-based queries. Given a query, such search tools search
their individual index and present the user with a list of items that are
potentially relevant, generally presented in ranked order. However large
the indexes are, still each search tool indexes only a subset of all
documents available on WWW. As more and more search tools become
available, each covering a different (overlapping) subset of Web
documents, it becomes increasingly difficult to choose the right one to use
for a specific information need. ProFusion has been developed to help
deal with this problem.
2. Related Work
There are several different approaches to managing the
proliferation of Web search engines. One solution is to use a large Web
page that lists several search engines and allows users to query one search
engine at a time. One example of this approach is All-in-One Search Page
[11]. Unfortunately, users still have to choose one search engine to
which to submit their search.
Another approach is to use intelligent agents
to bring back documents that are relevant to a user's interest. Such agents
[3][4] provide personal assistance to a user. For example, [3] describes an
adaptive agent that can bring back web pages of a user's interest daily. The
user gives relevance feedback to the agent by evaluating web pages that were
brought back. The agent them makes adjustment for future searches on relevant
web pages. However, these agents [3, 4] gather information from only their
own search index, which may limit the amount of information they have
access to.
A different approach is the meta search
method which builds on top of other search engines. Queries are
submitted to the meta search engine which in turn sends the query to
multiple single search engines. When retrieved items are returned by
the underlying search engines, it further processes these items and
presents relevant items to the user. ProFusion [9], developed at the
University of Kansas, is one such search engines.
The idea of using a single user interface for multiple distributed
information retrieval systems is not new. Initially, this work
concentrated on providing access to distributed, heterogeneous database
management systems [5]. More recently, meta searchers for the WWW have been
developed. For example, SavvySearch [8] selects the most promising
search engines automatically and then sends the user's query to the selected
search engines (usually 2 or 3) in parallel. SavvySearch does very little
post-processing. For example, the resulting document lists are not merged.
MetaCrawler [6, 7], on the other hand, sends out user's query
to all search engines it handles and collates search results from all search
engines. What distinguishes ProFusion from others is that it uses
sophisticated yet computationally simple methods to do post-processing.
3. Current ProFusion Prototype
3.1 General Architecture
ProFusion accepts a single query from the user and sends it to
multiple search engines in parallel. The current implementation of
ProFusion supports the following search engines: InfoSeek [12], Lycos [13],
Alta Vista [14], OpenText [15], WebCrawler [16], and Excite [17]. By default,
ProFusion will send a query to InfoSeek, Lycos, and Excite, but the user
may select any or all of the supported search engines. Search results
returned by the selected search engines are then further processed by
ProFusion. The post-processing includes merging the results to produce a
single ranked list, and removing duplicates and dead references, and pre-
fetching documents for faster viewing and further analysis.
3.2 User Interface
ProFusion queries are simple to form, they are merely a few words
describing a concept. Online help is available via a help button that leads
users to a page explaining the query syntax, including sample queries.
Users need only enter a query and press the "Search" button, however
there are several options available which give the user more control over
their search. The first option specifies whether or not the user wants to
have a short summary displayed for each retrieved item. The benefit of
displaying retrieved items without a summary is that a user can more
quickly scan retrieved items by title. The second option allows users to
select search engine(s) to which their query is sent. If more than one is
selected, the query is sent to selected search engines in parallel. All six
search engines can be selected if a user desires. Currently, the system
waits maximum 60 seconds to wait search engines to return results, but
controlling this time will be an option added in the future.
3.3 Duplicates Removal
Duplicates removal is based on a few simple rules. If two items
have exactly the same URL, they are duplicates. Similarly, if one URL is
"http://server/" and another one is "http://server/index.html", they are
duplicates. This removes approximately 10 - 20% of the retrieved URLs.
However, if two items have different URLs but the same title,
they might be duplicates. In this case, we break a URL into three parts:
protocol, server, and path. We then use n-gram method to test the
similarity of two paths. If they are sufficiently similar, we consider them
as duplicates. This appears to work very well in practice, removing an
additional 10 - 20% of the URLs, but runs the risk
that the URLs point to different versions of the same document, where
one is more up-to-date than the other. To avoid this risk, we could
retrieve the potential duplicates in whole or in part, and then compare
the two documents. However, this would increase network traffic and
might be substantially slower. This capability has been developed, and
will soon be added as an option.
3.4 Merge Algorithms
How to best merge individual ranked lists is an open question in
searching distributed information collections [2]. Callan [1]
evaluated merging techniques based on rank order, raw scores,
normalized statistics, and weighted scores. He found that the weighted
score merge is computationally simple yet as effective as the more
expensive normalized statistics merge. Therefore, in ProFusion, we use a
weighted score merging algorithm which is based on two factors: the
value of the query-document match reported by the search engine (Mdi)
and the estimated accuracy of that search engine (CFi).
For a search engine i, we calculated its confidence factor, CFi, by
evaluating its performance on a set of over 25 queries. The CFi reflects
the number of total relevant documents in top 10 hits and the ranking
accuracy for those relevant documents. Based on the results, the search
engines were assigned CFis ranging from 0.75 to 0.85. More work needs
to be done to systematically calculate and update the CFis, particularly
developing CFis which vary for a given search engine based on the
domain of the query.
When a set of documents is returned by search engine i, we
calculate the match factor for each document d, Mdi, by normalizing all
scores in the retrieval set to fall between 0 and 1. We do this by
dividing all values by the match value reported for the top ranking document.
If the match values reported by the search engine fall between 0 and 1,
they are unchanged. Then, we calculate the relevance weight for each
document d, Rdi, by multiplying its match factor, Mdi, by the search engines
confidence factor, CFi. The document's final rank is then determined by
merging the sorted documents lists based on their relevance weights, Rdi.
Duplicate removal is done within the merging algorithm, and the remaining
document's weight is the maximum value of Rdi reported by the multiple search
engines.
3.5 Search Result Presentation
The merge process described in the previous section yields a single
sorted list of items, each composed of a URL, a title, a relevance weight,
and a short summary. These items are then displayed to the user in
sorted order, with or without the summary, depending on user's
preference.
3.6 Other Implementation Details
ProFusion is written in Perl and is portable to any Unix platform.
It contains one Perl module for each search engine (currently six) which
forms syntactically correct queries and parses the search results to
extract each item's information. Other modules handle the user interface,
the document post-processing, and document fetching. Due to it's
modular nature, it is easy to extend ProFusion to additional search
engines.
ProFusion's main process creates multiple parallel sub-processes and
each sub-process sends a search request to one search engine and extracts
information from the results returned by the search engine. The main process
begins post-processing when all sub-processes terminate by returning
their results or by timing out (60 seconds in current prototype).
4. Information Filtering
We have extended the prototype so that the user can save a particular
search and have it automatically rerun on a periodic basis, (i.e., daily,
weekly or monthly). The results of previous searches will be stored along with
feedback from the user, if given, about whether or not the documents were
of interest. When a search is rerun, the top urls are examined. If there are new Web pages,
the user receives email announcing the availability of new information.
A query-specific Web page is built which summarizes the results, highlighting the new
documents. Thus, the system does the work continuously in the background, collecting
results for the user to view at his convenience. If the user marks the documents
as irrelevent, they are remembered (so they will not be re-presented to the
user if they are identified by future searches) but are dropped from the
resulst page.
Current work will increase the intelligence of the search engine by
analyzing the contents of retrieved documents to improve the ranking,
incorporating user preferences (e.g., do they prefer content-bearing pages
which contain mostly text or summary pages which primarily contain
links to further pages). Drawing on background work in corpus linguistics
and information retrieval [18], we will identify words
from relevant documents which can be used to automatically expand and
improve the user's query. As the retrieval sets grow, they will be
clustered based on their contents for better scanning of the results.
Finally, for truly broad coverage of an area, automatic query-specific
spiders will be incorporated which search out relevant documents by
starting from user-identified relevant documents.
Acknowledgments
ProFusion development was funded by the University of Kansas General Research
Fund. It runs on equipment provided through National Science
Foundation Award CDA-9401021. The corpus linguistics project is
funded by the National Science Foundation Award IRI-9409263.
References
[1] James Callan, Zhihong Lu, Bruce Croft, "Searching Distributed Collections
With Inference Networks," 18th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, 1995
[2] E. M. Voorhees, N. K. Gupta, and B. Johnson-Laird, "The Collection Fusion
Problem," in The Third Text REtrieval Conference (TREC-3), NIST special
publication 500-225 (D. K. Harman, ed.)
[3] M. Balabanovic, Y. Shoham, Y. Yun, "An Adaptive Agent for Automated Web
Browsing," Journal of Image Representation and Visual Communication 6(4),
December 1995
[4] A. Knoblock, Y. Arens, C. Hsu, "Cooperating Agents for Information
Retrieval," Proceedings of the second international conference on cooperative
information systems, University of Toronto Press, Toronto, Canada, 1994
[5] Y. Arens, C. Chee, C. Hsu, C. Knoblock, "Retrieving and Integrating Data
>From Multiple Information Sources," Journal on Intelligent and Cooperative
Information Systems, 2(2), 1993, Page 127-158
[6] Erik Selberg, Oren Etzioni, "Multi-Service Search and Comparison Using the
MetaCrawler," WWW4 conference, December 1995
[7] MetaCrawler search home page
URL:
[8] Daniel Dreilinger, Savvy Search Home Page,
URL:
[9] ProFusion search home page
URL:
[10] Sun Microsystems, Inc., Multithreaded Query Page
URL:
[11] William Cross, All-in-one Search Page
URL:
[12] InfoSeek Corporation, InfoSeek Home Page,
URL:
[13] Lycos Inc., Lycos Home Page,
URL:
[14] Digital Equipment Corporation, Alta Vista Home Page,
URL:
[15] Open Text, Inc., Open Text Web Index Home Page,
URL:
[16] WebCrawler home page
URL:
[17] Excite home page
URL:
[18] Susan Gauch and Meng Kam Chong, "Automatic Word Similarity for
TREC4 Query Expansion," Proc. of TREC4, Nov. 1995, Gaithersburg, MD
(to appear).