Information Fusion with ProFusion* Susan Gauch Department of Electrical Engineering and Computer Science The University of Kansas, Lawrence, KS 66045 sgauch@eecs.ukans.edu *http://www.eecs.ukans.edu/~sgauch/ddih.html Abstract The explosion of World Wide Web pages led to the development of search engines to manage the information overload. Today, there is a mini-explosion in World Wide Web search engines, which has led to the development of ProFusion. Military personnel, like other users, do not have the time to evaluate multiple search engines to knowledgeably select the best for their uses. Nor do they have the time to submit each query to multiple search engines and wade through the resulting flood of good information, duplicated information, irrelevant information, and missing documents. ProFusion is a meta search engine which sends user queries to multiple underlying search engines in parallel, retrieves and merges the resulting URLs. It identifies and removes duplicates and creates one relevance-ranked list. If desired, the actual documents can be pre-fetched to remove yet more duplicates and broken links. The performance of ProFusion compared to the individual search engines and other meta searchers is currently being evaluated. A paper submitted to WebNet '96 which describes the existing prototype (and the prototype itself) are available from http://www.eecs.ukans.edu/~sgauch/ddih.html. Recent extensions allow ProFusion to operate as an ongoing information filtering system which notifies users of new developments in their field. Ongoing work is focusing on making search processes more intelligent, creating independent search agents which retrieve an analyse the documents themselves, not merely the document URLs. The result of these two thrusts will be an intelligent search assistant which periodically searches the Web, collects the retrieved documents, compares the results with results already obtained, and notifies the user only of new and interesting results. 1. Introduction There are a huge number of documents on the World Wide Web, making it very difficult to locate information that is relevant to a user's interest. Search tools such as InfoSeek[12] and Lycos[13] index huge collections of Web documents, allowing users to search the World Wide Web via keyword-based queries. Given a query, such search tools search their individual index and present the user with a list of items that are potentially relevant, generally presented in ranked order. However large the indexes are, still each search tool indexes only a subset of all documents available on WWW. As more and more search tools become available, each covering a different (overlapping) subset of Web documents, it becomes increasingly difficult to choose the right one to use for a specific information need. ProFusion has been developed to help deal with this problem. 2. Related Work There are several different approaches to managing the proliferation of Web search engines. One solution is to use a large Web page that lists several search engines and allows users to query one search engine at a time. One example of this approach is All-in-One Search Page [11]. Unfortunately, users still have to choose one search engine to which to submit their search. Another approach is to use intelligent agents to bring back documents that are relevant to a user's interest. Such agents [3][4] provide personal assistance to a user. For example, [3] describes an adaptive agent that can bring back web pages of a user's interest daily. The user gives relevance feedback to the agent by evaluating web pages that were brought back. The agent them makes adjustment for future searches on relevant web pages. However, these agents [3, 4] gather information from only their own search index, which may limit the amount of information they have access to. A different approach is the meta search method which builds on top of other search engines. Queries are submitted to the meta search engine which in turn sends the query to multiple single search engines. When retrieved items are returned by the underlying search engines, it further processes these items and presents relevant items to the user. ProFusion [9], developed at the University of Kansas, is one such search engines. The idea of using a single user interface for multiple distributed information retrieval systems is not new. Initially, this work concentrated on providing access to distributed, heterogeneous database management systems [5]. More recently, meta searchers for the WWW have been developed. For example, SavvySearch [8] selects the most promising search engines automatically and then sends the user's query to the selected search engines (usually 2 or 3) in parallel. SavvySearch does very little post-processing. For example, the resulting document lists are not merged. MetaCrawler [6, 7], on the other hand, sends out user's query to all search engines it handles and collates search results from all search engines. What distinguishes ProFusion from others is that it uses sophisticated yet computationally simple methods to do post-processing. 3. Current ProFusion Prototype 3.1 General Architecture ProFusion accepts a single query from the user and sends it to multiple search engines in parallel. The current implementation of ProFusion supports the following search engines: InfoSeek [12], Lycos [13], Alta Vista [14], OpenText [15], WebCrawler [16], and Excite [17]. By default, ProFusion will send a query to InfoSeek, Lycos, and Excite, but the user may select any or all of the supported search engines. Search results returned by the selected search engines are then further processed by ProFusion. The post-processing includes merging the results to produce a single ranked list, and removing duplicates and dead references, and pre- fetching documents for faster viewing and further analysis. 3.2 User Interface ProFusion queries are simple to form, they are merely a few words describing a concept. Online help is available via a help button that leads users to a page explaining the query syntax, including sample queries. Users need only enter a query and press the "Search" button, however there are several options available which give the user more control over their search. The first option specifies whether or not the user wants to have a short summary displayed for each retrieved item. The benefit of displaying retrieved items without a summary is that a user can more quickly scan retrieved items by title. The second option allows users to select search engine(s) to which their query is sent. If more than one is selected, the query is sent to selected search engines in parallel. All six search engines can be selected if a user desires. Currently, the system waits maximum 60 seconds to wait search engines to return results, but controlling this time will be an option added in the future. 3.3 Duplicates Removal Duplicates removal is based on a few simple rules. If two items have exactly the same URL, they are duplicates. Similarly, if one URL is "http://server/" and another one is "http://server/index.html", they are duplicates. This removes approximately 10 - 20% of the retrieved URLs. However, if two items have different URLs but the same title, they might be duplicates. In this case, we break a URL into three parts: protocol, server, and path. We then use n-gram method to test the similarity of two paths. If they are sufficiently similar, we consider them as duplicates. This appears to work very well in practice, removing an additional 10 - 20% of the URLs, but runs the risk that the URLs point to different versions of the same document, where one is more up-to-date than the other. To avoid this risk, we could retrieve the potential duplicates in whole or in part, and then compare the two documents. However, this would increase network traffic and might be substantially slower. This capability has been developed, and will soon be added as an option. 3.4 Merge Algorithms How to best merge individual ranked lists is an open question in searching distributed information collections [2]. Callan [1] evaluated merging techniques based on rank order, raw scores, normalized statistics, and weighted scores. He found that the weighted score merge is computationally simple yet as effective as the more expensive normalized statistics merge. Therefore, in ProFusion, we use a weighted score merging algorithm which is based on two factors: the value of the query-document match reported by the search engine (Mdi) and the estimated accuracy of that search engine (CFi). For a search engine i, we calculated its confidence factor, CFi, by evaluating its performance on a set of over 25 queries. The CFi reflects the number of total relevant documents in top 10 hits and the ranking accuracy for those relevant documents. Based on the results, the search engines were assigned CFis ranging from 0.75 to 0.85. More work needs to be done to systematically calculate and update the CFis, particularly developing CFis which vary for a given search engine based on the domain of the query. When a set of documents is returned by search engine i, we calculate the match factor for each document d, Mdi, by normalizing all scores in the retrieval set to fall between 0 and 1. We do this by dividing all values by the match value reported for the top ranking document. If the match values reported by the search engine fall between 0 and 1, they are unchanged. Then, we calculate the relevance weight for each document d, Rdi, by multiplying its match factor, Mdi, by the search engines confidence factor, CFi. The document's final rank is then determined by merging the sorted documents lists based on their relevance weights, Rdi. Duplicate removal is done within the merging algorithm, and the remaining document's weight is the maximum value of Rdi reported by the multiple search engines. 3.5 Search Result Presentation The merge process described in the previous section yields a single sorted list of items, each composed of a URL, a title, a relevance weight, and a short summary. These items are then displayed to the user in sorted order, with or without the summary, depending on user's preference. 3.6 Other Implementation Details ProFusion is written in Perl and is portable to any Unix platform. It contains one Perl module for each search engine (currently six) which forms syntactically correct queries and parses the search results to extract each item's information. Other modules handle the user interface, the document post-processing, and document fetching. Due to it's modular nature, it is easy to extend ProFusion to additional search engines. ProFusion's main process creates multiple parallel sub-processes and each sub-process sends a search request to one search engine and extracts information from the results returned by the search engine. The main process begins post-processing when all sub-processes terminate by returning their results or by timing out (60 seconds in current prototype). 4. Information Filtering We have extended the prototype so that the user can save a particular search and have it automatically rerun on a periodic basis, (i.e., daily, weekly or monthly). The results of previous searches will be stored along with feedback from the user, if given, about whether or not the documents were of interest. When a search is rerun, the top urls are examined. If there are new Web pages, the user receives email announcing the availability of new information. A query-specific Web page is built which summarizes the results, highlighting the new documents. Thus, the system does the work continuously in the background, collecting results for the user to view at his convenience. If the user marks the documents as irrelevent, they are remembered (so they will not be re-presented to the user if they are identified by future searches) but are dropped from the resulst page. Current work will increase the intelligence of the search engine by analyzing the contents of retrieved documents to improve the ranking, incorporating user preferences (e.g., do they prefer content-bearing pages which contain mostly text or summary pages which primarily contain links to further pages). Drawing on background work in corpus linguistics and information retrieval [18], we will identify words from relevant documents which can be used to automatically expand and improve the user's query. As the retrieval sets grow, they will be clustered based on their contents for better scanning of the results. Finally, for truly broad coverage of an area, automatic query-specific spiders will be incorporated which search out relevant documents by starting from user-identified relevant documents. Acknowledgments ProFusion development was funded by the University of Kansas General Research Fund. It runs on equipment provided through National Science Foundation Award CDA-9401021. The corpus linguistics project is funded by the National Science Foundation Award IRI-9409263. References [1] James Callan, Zhihong Lu, Bruce Croft, "Searching Distributed Collections With Inference Networks," 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995 [2] E. M. Voorhees, N. K. Gupta, and B. Johnson-Laird, "The Collection Fusion Problem," in The Third Text REtrieval Conference (TREC-3), NIST special publication 500-225 (D. K. Harman, ed.) [3] M. Balabanovic, Y. Shoham, Y. Yun, "An Adaptive Agent for Automated Web Browsing," Journal of Image Representation and Visual Communication 6(4), December 1995 [4] A. Knoblock, Y. Arens, C. Hsu, "Cooperating Agents for Information Retrieval," Proceedings of the second international conference on cooperative information systems, University of Toronto Press, Toronto, Canada, 1994 [5] Y. Arens, C. Chee, C. Hsu, C. Knoblock, "Retrieving and Integrating Data >From Multiple Information Sources," Journal on Intelligent and Cooperative Information Systems, 2(2), 1993, Page 127-158 [6] Erik Selberg, Oren Etzioni, "Multi-Service Search and Comparison Using the MetaCrawler," WWW4 conference, December 1995 [7] MetaCrawler search home page URL: [8] Daniel Dreilinger, Savvy Search Home Page, URL: [9] ProFusion search home page URL: [10] Sun Microsystems, Inc., Multithreaded Query Page URL: [11] William Cross, All-in-one Search Page URL: [12] InfoSeek Corporation, InfoSeek Home Page, URL: [13] Lycos Inc., Lycos Home Page, URL: [14] Digital Equipment Corporation, Alta Vista Home Page, URL: [15] Open Text, Inc., Open Text Web Index Home Page, URL: [16] WebCrawler home page URL: [17] Excite home page URL: [18] Susan Gauch and Meng Kam Chong, "Automatic Word Similarity for TREC4 Query Expansion," Proc. of TREC4, Nov. 1995, Gaithersburg, MD (to appear).