Information Fusion with ProFusion*

			Susan Gauch

		Department of Electrical Engineering and Computer Science
		The University of Kansas, Lawrence, KS 66045

			sgauch@eecs.ukans.edu

		*http://www.eecs.ukans.edu/~sgauch/ddih.html

Abstract

	The explosion of World Wide Web pages led to the development 
of search engines to manage the information overload.  Today, there is a 
mini-explosion in World Wide Web search engines, which has led to the 
development of ProFusion.  Military personnel, like other users, do not have 
the time to evaluate multiple search engines to knowledgeably select the best 
for their uses.  Nor do they have the time to submit each query to 
multiple search engines and wade through the resulting flood of good 
information, duplicated information, irrelevant information, and 
missing documents.  ProFusion is a meta search engine which sends user 
queries to multiple underlying search engines in parallel, retrieves and 
merges the resulting URLs.  It identifies and removes duplicates and 
creates one relevance-ranked list.  If desired, the actual documents can be 
pre-fetched to remove yet more duplicates and broken links.  The 
performance of ProFusion compared to the individual search engines and 
other meta searchers is currently being evaluated.  A paper submitted
to WebNet '96 which describes the existing prototype (and the prototype 
itself) are available from http://www.eecs.ukans.edu/~sgauch/ddih.html.
Recent extensions allow ProFusion to operate as an ongoing
information filtering system which notifies users of new developments in
their field.  Ongoing work is focusing on making search processes more intelligent,
creating independent search agents which retrieve an analyse the documents
themselves, not merely the document URLs.

The result of these two thrusts will be an intelligent search assistant
which periodically searches the Web, collects the retrieved documents,
compares the results with results already obtained, and notifies the user
only of new and interesting results.

1. Introduction

	There are a huge number of documents on the World Wide Web, 
making it very difficult to locate information that is relevant to a user's 
interest. Search tools such as InfoSeek[12] and Lycos[13] index huge 
collections of  Web documents, allowing users to search the World Wide 
Web via keyword-based queries. Given a query, such search tools search 
their individual index and present the user with a list of items that are 
potentially relevant, generally presented in ranked order.  However large 
the indexes are, still each search tool indexes only a subset of all 
documents available on WWW.  As more and more search tools  become 
available, each covering a different (overlapping) subset of Web 
documents, it becomes increasingly difficult to choose the right one to use 
for a specific information need.  ProFusion has been developed to help 
deal with this problem.

2. Related Work

	There are several different approaches to managing the 
proliferation of Web search engines.  One solution is to use a large Web 
page that lists several search engines and allows users to query one search 
engine at a time.  One example of this approach is All-in-One Search Page 
[11].  Unfortunately, users still have to choose one  search engine to 
which to submit their search.  

	Another approach is to use intelligent agents 
to bring back documents that are relevant to a user's interest.  Such agents 
[3][4] provide personal assistance to a user.  For example, [3] describes an 
adaptive agent that can bring back web pages of a user's interest daily. The 
user gives relevance feedback to the agent by evaluating web pages that were 
brought back. The agent them makes adjustment for future searches on relevant 
web pages.  However, these agents [3, 4] gather information from only their
own search index, which may limit the amount of information they have
access to.

	A different approach is the meta search 
method which builds on top of other search engines.  Queries are 
submitted to the meta search engine which in turn sends the query to 
multiple single search  engines.  When retrieved items are returned by 
the underlying search engines, it further processes these items and 
presents relevant items to the user.   ProFusion [9], developed at the 
University of Kansas,  is one such search engines.

	The idea of using a single user interface for multiple distributed  
information retrieval systems is not new.  Initially, this work 
concentrated on providing access to distributed,  heterogeneous database 
management systems [5].  More recently, meta searchers for the WWW have been 
developed.  For example, SavvySearch [8] selects the most promising
search engines automatically and then sends the user's query to the selected
search engines (usually 2 or 3) in parallel.  SavvySearch does very little 
post-processing.  For example, the resulting document lists are not merged.  
MetaCrawler [6, 7], on the other hand, sends out user's query
to all search engines it handles and collates search results from all search
engines.  What distinguishes ProFusion from others is that it uses 
sophisticated yet computationally simple methods to do post-processing. 

3. Current ProFusion Prototype

3.1 General Architecture 	

	ProFusion accepts a single query from the user and sends it to  
multiple search engines in parallel.  The current implementation of 
ProFusion supports the following search engines:  InfoSeek [12], Lycos [13], 
Alta Vista [14], OpenText [15], WebCrawler [16], and Excite [17].  By default, 
ProFusion will send a query to InfoSeek,  Lycos, and Excite, but the user 
may select any or all of the supported search engines.  Search results 
returned by the selected search engines  are then further processed by 
ProFusion.  The post-processing includes merging  the results to produce a 
single ranked list, and removing duplicates and  dead references, and pre-
fetching documents for faster viewing and further analysis.  

3.2 User Interface 	

	ProFusion queries are simple to form, they are merely a few  words 
describing a concept.  Online help is available via a help button that leads 
users to a page explaining the query syntax, including sample queries.  
Users need only enter a query and press the "Search" button, however 
there are several options available which give the user more control over 
their search.  The first option specifies whether or not the user wants to 
have a short summary displayed for each retrieved item.  The benefit of 
displaying retrieved items without a summary is that a user can more 
quickly scan retrieved items by title.  The second option allows users to 
select  search engine(s) to which their query is sent.  If more than one is 
selected,  the query is  sent to selected search engines in parallel. All six 
search engines can be  selected if a user desires.  Currently, the system 
waits maximum 60 seconds  to wait search engines to return results, but 
controlling this time will be an option added in the future. 

3.3 Duplicates Removal 	

	Duplicates removal is based on a few simple rules.  If two items 
have exactly the same URL, they are duplicates.  Similarly, if one URL is  
"http://server/" and another one is "http://server/index.html", they are  
duplicates.  This removes approximately 10 - 20% of the retrieved URLs.  
However, if two items have different URLs but the same title,  
they might be duplicates.  In this case, we break a URL into three parts:  
protocol, server,  and path. We then use n-gram method to test the 
similarity of two paths. If  they are sufficiently similar, we consider them 
as duplicates.  This appears to work very well in practice, removing an
additional 10 - 20% of the URLs, but runs the risk 
that the URLs point to different versions of the same document, where 
one is more up-to-date than the other.  To avoid this risk, we could 
retrieve the potential duplicates  in whole or in part, and then compare 
the two documents.  However, this would increase network traffic and 
might be substantially slower.   This capability has been developed, and 
will soon be added as an option.

3.4 Merge Algorithms 	

	How to best merge individual ranked lists is an open question in 
searching distributed information collections [2].  Callan [1] 
evaluated merging techniques based on rank order, raw scores, 
normalized statistics, and weighted scores.  He found that the weighted 
score merge is computationally simple yet as effective as the more 
expensive normalized  statistics merge.  Therefore, in ProFusion, we use a 
weighted score merging algorithm which is based on two factors:  the 
value of the query-document match reported by the search engine (Mdi) 
and the estimated accuracy of that search engine (CFi).

	For a search engine i, we calculated its confidence factor, CFi, by  
evaluating its performance on a set of over 25 queries.  The CFi reflects 
the number of total relevant documents in top 10 hits and the ranking 
accuracy for those relevant documents.  Based on the results, the search
engines were assigned CFis ranging from 0.75 to 0.85.  More work needs
to be done to systematically calculate and update the CFis, particularly
developing CFis which vary for a given search engine based on the
domain of the query.

	When a set of documents is returned by search engine i, we 
calculate the match factor for each document d, Mdi, by normalizing all 
scores in the retrieval set to fall between 0 and 1.  We do this by
dividing all values by the match value reported for the top ranking document.
If the match values reported by the search engine fall between 0 and 1, 
they are unchanged.  Then, we calculate the relevance weight for each 
document d, Rdi,  by multiplying its match factor, Mdi, by the search engines 
confidence factor, CFi.  The document's final rank is then determined by 
merging the sorted documents lists based on their relevance weights, Rdi. 
Duplicate removal is done within the merging algorithm, and the remaining 
document's weight is the maximum value of Rdi reported by the multiple search 
engines.

3.5 Search Result Presentation 	

	The merge process described in the previous section yields a single 
sorted list of items, each composed of a URL, a title, a relevance weight, 
and a short summary.  These items are then displayed to the user in 
sorted order, with or without the summary, depending on user's 
preference. 

3.6 Other Implementation Details 	
	
	ProFusion is written in Perl and is portable to any Unix platform.  
It contains one Perl module for each search engine (currently six) which 
forms syntactically correct queries and parses the search results to 
extract each item's information.  Other modules handle the user interface,
the document post-processing, and document fetching.  Due to it's 
modular nature, it is easy to extend ProFusion to additional search 
engines.  

	ProFusion's main process creates multiple parallel sub-processes and 
each sub-process sends a search request to one search engine and extracts 
information from the results returned by the search engine.  The main process 
begins post-processing when all sub-processes terminate by returning 
their results or by timing out (60 seconds in current prototype).

4. Information Filtering 

We have extended the prototype so that the user can save a particular
search and have it automatically rerun on a periodic basis, (i.e., daily,
weekly or monthly).  The results of previous searches will be stored along with
feedback from the user, if given, about whether or not the documents were
of interest.  When a search is rerun, the top urls are examined.  If there are new Web pages,
the user receives email announcing the availability of new information.  
A query-specific Web page is built which summarizes the results, highlighting the new
documents.  Thus, the system does the work continuously in the background, collecting
results for the user to view at his convenience.  If the user marks the documents
as irrelevent, they are remembered (so they will not be re-presented to the
user if they are identified by future searches) but are dropped from the
resulst page.

Current work will increase the intelligence of the search engine by
analyzing the contents of retrieved documents to improve the ranking, 
incorporating user preferences (e.g., do they prefer content-bearing pages 
which contain mostly text or summary pages which primarily contain 
links to further pages).  Drawing on background work in corpus linguistics
and information retrieval [18], we will identify words
from relevant documents which can be used to automatically expand and
improve the user's query.  As the retrieval sets grow, they will be
clustered based on their contents for better scanning of the results.
Finally, for truly broad coverage of an area, automatic query-specific
spiders will be incorporated which search out relevant documents by 
starting from user-identified relevant documents.

Acknowledgments

ProFusion development was funded by the University of Kansas General Research 
Fund.  It runs on equipment provided through National Science 
Foundation Award CDA-9401021.  The corpus linguistics project is
funded by the National Science Foundation Award IRI-9409263.

 References

[1] James Callan, Zhihong Lu, Bruce Croft, "Searching Distributed Collections
With Inference Networks," 18th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, 1995

[2] E. M. Voorhees, N. K. Gupta, and B. Johnson-Laird, "The Collection Fusion
Problem," in The Third Text REtrieval Conference (TREC-3), NIST special 
publication 500-225 (D. K. Harman, ed.)

[3] M. Balabanovic, Y. Shoham, Y. Yun, "An Adaptive Agent for Automated Web 
Browsing," Journal of Image Representation and Visual Communication 6(4), 
December 1995

[4] A. Knoblock, Y. Arens, C. Hsu, "Cooperating Agents for Information
Retrieval," Proceedings of the second international conference on cooperative
information systems, University of Toronto Press, Toronto, Canada, 1994

[5] Y. Arens, C. Chee, C. Hsu, C. Knoblock, "Retrieving and Integrating Data
>From Multiple Information Sources," Journal on Intelligent and Cooperative
Information Systems, 2(2), 1993, Page 127-158 

[6] Erik Selberg, Oren Etzioni, "Multi-Service Search and Comparison Using the
MetaCrawler," WWW4 conference, December 1995

[7] MetaCrawler search home page
	URL: <http://www.cs.washington.edu:8080/>

[8] Daniel Dreilinger, Savvy Search Home Page, 
	URL: <http://www.cs.colostate.edu/~dreiling/smartform.html>

[9] ProFusion search home page
	URL: <http://www.designlab.ukans.edu/ProFusion.html>

[10] Sun Microsystems, Inc., Multithreaded Query Page
	URL: <http://www.sun.com/cgi-bin/show?search/mtquery/index.body>

[11] William Cross, All-in-one Search Page
	URL: <http://www.albany.net/allinone/>

[12] InfoSeek Corporation, InfoSeek Home Page, 
	URL: <http://www.infoseek.com/>

[13] Lycos Inc., Lycos Home Page, 
	URL: <http://www.lycos.com/>

[14] Digital Equipment Corporation, Alta Vista Home Page, 
	URL: <http://altavista.digital.com/>

[15] Open Text, Inc., Open Text Web Index Home Page,
	URL: <http://www.opentext.com/omw/f-omw.html>

[16] WebCrawler home page
	URL: <http://www.webcrawler.com/>

[17] Excite home page
	URL: <http://www.excite.com/>

[18] Susan Gauch and Meng Kam Chong, "Automatic Word Similarity for 
TREC4 Query Expansion," Proc. of TREC4, Nov. 1995, Gaithersburg, MD 
(to appear).