k-bitruss-Attributed Weighted Community Search on Attributed Weighted Bipartite Graphs

Zuxuan Zhang; Hongxi Li; Kai Huang; Yuncheng Jiang

doi:10.1109/BigData59044.2023.10386761

Abstract

Nowadays, information overload hinders the discovery of business intelligence on the World Wide Web. Existing business intelligence tools suffer from a lack of analysis and visualization capabilities and traditional result list display by search engines often overwhelms business analysts with irrelevant information. Thus, developing tools that enable better analysis while reduce information overload has been a challenge. The literature show that hierarchical and map displays enable effective access and browsing of information. However, they have not been widely applied to discover business intelligence on the Web. This research proposes Business Intelligence Explorer, a tool implementing the steps in a knowledge map framework for discovering business intelligence on the Web. Two browsing methods, namely, Web community and knowledge map, have been implemented. Web community uses a genetic algorithm to organize different Web sites into a hierarchical format. Knowledge map uses a multidimensional scaling algorithm to place different Web sites as points on a map. Preliminary results of our user study show that Web community helps users locate results quickly and effectively. Users liked the intuitive map display of knowledge map. Our Business Intelligence Explorer contributes to alleviate information overload in business analysis. Future directions on applying document visualization techniques in discovering business intelligence are described.

1. Introduction

Nowadays, information overload hinders the discovery of business intelligence on the World Wide Web. A study found that most of the world's information ¹ was stored in computer hard drives or departmental servers ^[2], which form the repository of the Internet. Business analysts often suffer from information overload because the Internet is one of the top five sources of business intelligence information ^[3]. For example, a business analyst in the database technology field may ask the following questions: What is the landscape of the field database technology? What are the different subgroups inside the community of database technology companies? Which group of communities does our company belong to in the entire competitive environment? Which 10 competitors in our field most resemble us? Answers to these questions reveal business intelligence, defined as “the acquisition, interpretation, collation, assessment, and exploitation of information” ^[4] in the business domain.

Web search engines are commonly used to locate information for business analysis. They usually retrieve a large number of Web pages on simple query search. Overwhelmed by the many results, business analysts often browse individual pages only, while they cannot find the Web communities from or visualize the landscape of all results. The textual list display of search engines buries lots of relevant results, making it hard to sift through relevant results from irrelevant ones. Thus, result list display often leads to information overload problem.

Apart from textual result list and community search engines, new browsing methods are needed by business analysts to enable automatic visualization of landscape and discovery of communities on the Web. These new methods can potentially provide better analysis while reduce information overload. To develop such new browsing methods, we need to review past researches on the following issues: What are the existing business intelligence tools? How do they perform? How can information visualization techniques help to discover underlying patterns from documents (e.g., Web pages)? In particular, what analysis techniques, algorithms and visualization metaphors are used in the literature?

2. Literature Review

2.1. Business Intelligence Tools

Business intelligence (BI) tools enable organizations to understand their internal and external environment through the systematic acquisition, collation, analysis, interpretation and exploitation of information. Two classes of intelligence tools are defined ^[5]. The first class of tools is used to manipulate massive operational data and to extract essential business information from them. Examples include decision support systems, executive information systems, online-analytical processing (OLAP), data warehouses and data mining systems. They are built on database management systems and are used to reveal trends and patterns that would otherwise be buried in their huge operational databases ^[6]. The second class of tools, sometimes called competitive intelligence tools, aims at systematically collecting and analyzing information from the competitive environment to assist organizational decision making. This review focuses on the second class of tools, where information is mainly gathered from public sources such as the Web.

Fuld et al. found that the global interest in intelligence technology has increased significantly over the past four years ^[7]. They compared 13 BI tools based on a five-stage intelligence cycle: (1) Planning and direction, (2) Published information collection, (3) Source collection from human, (4) Analysis, and (5) Report and inform. We are interested in step 2, 4, and 5 which can be automated by using information technologies. They concluded that more BI tools should use intelligent agents to dynamically retrieve information (step 2), existing analysis capabilities of BI tools are still weak (step 4), and these tools generally provide good reporting capabilities in textual, table or chart formats (step 5).

A closer look at BI tools reveal their weaknesses in content collection, analysis and interface used to display large amount of information. In general, many BI tools simply provide different views of the collected information (e.g., Market Signal Analyzer, BrandPulse) but not more thorough analysis. Some more advanced tools use text-mining and rule-based techniques to process collected information. For example, ClearResearch Suite extracts information from documents and shows a visual layout of relationships between entities such as people, companies, and relationships, and events. However, such kind of analysis capability is not commonly provided in many BI tools. In terms of the interface of displaying the results, many BI tools integrate their reports with Microsoft Office products and present them in textual format. Owing to the limited analysis capability, they are not capable of showing the landscape of large number of documents collected from the Web.

2.2. Document Visualization Techniques

To deal with the problems of result list display of hypertext, researchers in human computer interaction and information retrieval have proposed frameworks and techniques to create visual displays for textual information stored in computer. Shneiderman proposed a task by data type taxonomy (TTT) for information visualization ^[8] in which he proposed seven data types (1D, 2D, 3D, temporal, multidimensional, tree, network) and seven tasks (overview, zoom, filter, details on demand, extract, history, relate). Traditional result list display of hypertext belongs to 1-dimensional data types. While result list is still widely used in many Web search engines or information retrieval systems, it only allows limited room for browsing (e.g., scrolling a long list of results). In contrast, data types such as 2-dimensional data, tree data, and network data allow more browsing tasks to be done and support human visual capabilities more effectively. Four types of visual display format are identified in ^[9]: hierarchical displays, network displays, scatter displays and map displays. Compared with the data types in TTT, hierarchical displays are similar to tree data type, network displays are similar to network data type, and both scatter displays and map displays are similar to 2-dimensional data type. Among these four displays, hierarchical (tree) displays were shown to be an effective information access tool, particularly for browsing ^[10], scatter displays most faithfully reflect underlying structures of data among the first three displays ^[9], and map displays could provide a view of the entire collection of items at a distance, according to one of the earliest researchers who proposed the use of map display for information retrieval ^[11]. In summary, hierarchical and map displays of Web search results can potentially alleviate the problems of traditional result list display of hypertext. However, they are not widely used in existing search engines.

Document visualization primarily concerns with the task of getting insight into information obtained from one or more documents, but without users reading those documents ^[12]. Most schemes of document visualization involve three stages: analysis, algorithms, and visualization ^[13]. In the analysis stage, essential features of a collection of text are extracted according to users' interests expressed as keywords. In the algorithms stage, an efficient and flexible structure of the document set is created by clustering and projecting the high-dimensional structure into a two or three-dimensional space. In the visualization stage, the data is presented to users and made sensitive to interaction. The following review techniques used in the three stages of document visualization.

Stage 1 Document Analysis

In this part, Web mining techniques and meta searching are discussed in the context of document analysis. Web mining techniques have been applied to analyze documents on the Web. It involves the tasks of resource discovery on the Web, information extraction from Web resources, and uncovering general patterns at individual Web sites and across multiple sites ^[14]. Two categories of Web mining are usually applied in the analysis of Web documents: Web content mining and Web structure mining. Many efforts have been made to combine Web content mining and Web structure mining to improve the quality of analysis. For example, using similarity metric which incorporates textual information, hyperlink structure and co-citation relations, He et al. proposed an unsupervised clustering method that was shown to effectively identify relevant topics ^[15]. The clustering method employs a graph-partitioning method based on normalized cut criterion first developed for image segmentation ^[16]. Bharat and Henzinger augmented a previous connectivity analysis based algorithm with content analysis and achieved an improvement of precision by at least 45% over purely connectivity analysis ^[17]. Chakrabarti et al. augmented the HITS algorithm by considering the anchor texts and showed that their system can be used to compile large topic taxonomies automatically ^[18]. For the purpose of finding Web pages that form communities, the approach used by He et al. has the advantage of combining both Web content information and Web structure information for clustering. In addition to the analysis of document, meta searching has been shown to be a highly effective method of resource discovery and collection on the Web ^[19]–^[21]. It also facilitates further processing such as clustering and visualization.

Stage 2 Algorithms

Many algorithms for creating an efficient and flexible structure of the document set are available. In particular, cluster algorithms and multidimensional scaling algorithms are frequently used in visualization. Cluster algorithms classify objects into meaningful disjoint subsets or partitions. An important objective of cluster algorithms is to achieve high homogeneity within each cluster and large disassociation between different clusters. Two categories of cluster algorithms are commonly used: hierarchical and partitional methods ^[22]. Both hierarchical and partitional cluster methods have their strengths and weaknesses. While no theory exists to select the best clustering method for a particular application (p.88, ^[22]), factors such as computational efficiency, quality of clusters formed, and visual impact can be considered. Grabmeier & Rudolph pointed out that hierarchical method is good for initial partitioning, but partitional method tries to achieve optimization ^[23]. Thus partitional method brings about higher quality of partitions formed. However, hierarchical method is often very efficient and provides a visual dendrogram. For the use in visualization and Web browsing, it appears that the combination of hierarchical and partitional cluster methods would provide a better clustering quality.

Multidimensional scaling (MDS) algorithms are a family of techniques that portray the data's structure in a spatial fashion ^[24]. It constructs a geometric representation of the data (such as a similarity matrix), usually in a Euclidean space of low dimensionality. Based on earlier work on multidimensional psychophysics ^[25] and point mutual distances ^[26], Torgerson provided the first systematic procedure for determining the multidimensional map of points (metric solutions) from errorful interpoint distances ^[27]. Kruskal introduced nonmetric multidimensional scaling by using his least square monotonic transformation and the method of steepest descent ^[28]. Takane, Young and de Leeuw consolidated many preceding developments into a single approach that is capable of either metric or nonmetric analysis using either weighted or unweighted Euclidean model ^[29]. Their approach, ALSCAL, becomes the standard MDS procedure in many statistical packages such as SPSS. Apart from its theoretical aspect, MDS has been applied in many different domains. He and Hui applied MDS to display author cluster maps in their author co-citation analysis ^[30]. Eom and Farris applied MDS to an author co-citation of the decision support systems (DSS) literature over 1971 through 1990 in order to find contributing fields to DSS ^[31]. McQuaid et al. used MDS to visualize relationships between documents on group memory in an attempt to overcome information overload ^[32]. Kanai and Hakozaki used MDS to lay out 3D thumbnail objects in information space for visualizing user preferences ^[33]. Kealy applied MDS to study the change in the knowledge map of groups over time to determine the influence of a computer-based collaborative learning environment on conceptual understanding ^[34]. Although much has been done using MDS to visualize relationships of objects in different domains, none of them is found to apply it to the discovery of business intelligence on the Web. In addition, no existing search engine applies MDS to facilitate Web browsing.

Stage 3 Visualization

Visualization is the process of displaying the encoded data in a visual format that can be perceived by human eyes. The output often takes the form of a knowledge map, which is a knowledge representation that reveals the underlying relationships of the knowledge sources, such as Web page content, newsgroup messages, business market trends, newspaper articles, and other textual and numerical information. Early work on creating knowledge map involves manual drawing of blocks and connecting lines representing concepts and relationships respectively, such as Concept Map ^[35] and Mind Map ^[36]. As the Web was incepted in the 1990s and becomes the major knowledge repository, automatically generated knowledge maps have been proposed. The Galaxy of News system displays news articles and titles on a three-dimensional space and allows users to move in a continuous fashion ^[37]. Users can control the zooming level to decide on how detailed he/she wants to browse the news content. The problems of the system are the use of a single color (white text, black background) to display the content and serious overlapping of texts when users choose to browse more content. Another three-dimensional colored display, called Themescape, is a landscape showing a set of news articles on a map with contour lines ^[12]. Documents in a themescape are represented by small points and those with similar content are placed close together using proprietary lexical algorithms. Peaks represent a concentration of closely related documents and valleys contain fewer documents and more unique content. Themescape was implemented as Cartia's NewsMap (http://www.cartia.com/) to show articles related to financial industry. It allows users to specify a focus circle, to flag certain points, and display details of articles. However, when many articles are placed closely to a peak, it is difficult for users to distinguish them without viewing details of all articles. A neural network technique, called Kohonen's self-organizing map (SOM), takes a set of input objects and maps them onto the nodes of a two-dimensional grid ^[38]. Lin et al. used a single-layer SOM to cluster concepts in a small database of 140 abstracts related to artificial intelligence ^[39]. Using manually indexed descriptors as concepts, they found that SOM is able to create a semantic map that captures the relationships between concepts. Chen et al. applied SOM to automatically generate a hierarchical knowledge map by categorizing around 110,000 Web pages contained in the Yahoo! entertainment subcategory ^[40]. In a subsequent experiment, Chen et al. showed that SOM performed very well with broad browsing tasks and subjects liked the visual and graphical aspects of the map ^[41]. Lin applied SOM to display 1,287 documents found in DIALOG's INSPEC database. He used the most frequently occurring 637 words to index the documents ^[9]. Yang et al. showed that both fisheye and fractal views can increase the effectiveness of the SOM visualization ^[42]. Kartoo, a commercial search engine, presents search results as interconnected objects on a map (http://www.kartoo.com/). Each page shows ten results represented as circles of varying sizes corresponding to their relevance to the search query. The circles are interconnected by lines showing common keywords of the results. Different details (such as summary of results, related keywords) are provided while users are moving the mouse on the screen. However, their placement of results on the screen does not bear specific meaning, nor does it reflect the similarity of Web pages.

From our literature review, we found three research gaps. First, existing BI tools suffer from a lack of analysis and visualization capabilities. There is a need to develop better methods to enable visualization of landscape and discovery of communities from public sources such as the Web. Second, hierarchical and map displays were shown to be effective ways to access and browse information. However, they have not been widely applied to discover business intelligence on the Web. Third, none of the existing search engines allows users to visualize the relationships among the search results in terms of the relative closeness of them. Therefore, we identified two research questions: (1) How can document analysis techniques be used to assist in the business intelligence cycle? (2) How can hierarchical and map display of information help to discover business intelligence on the Web?

To address our research questions, we follow the system development methodology ^[43]. We started by selecting appropriate algorithms and techniques, proceeded to system development and finally empirically evaluated our proof-of-concept prototype.

3. Business Intelligence Explorer: a Knowledge Map Framework

This section presents Business Intelligence Explorer , which implements the steps in a knowledge map framework for discovering business intelligence on the Web. Figure 1 shows our proposed knowledge map framework and Figure 2 shows the user interface of our Business Intelligence Explorer . Six steps are involved in implementing our prototype. The specific steps in implementing the Business Intelligence Explorer are detailed as follows.

3.1. Identifying key terms

The purpose of this step is to identify key terms that are used as queries to search for Web pages. These queries are all related to “business intelligence” because we want to demonstrate the capability of our framework in discovering business intelligence on the Web. To identify the queries, we entered the term “business intelligence” into the INSPEC literature indexing system. INSPEC is one of the leading English-language bibliographic information services providing access to the world's scientific and technical literature. It is used by IT professionals, business practitioners and researchers to search for business and technical articles. The INSPEC system returned 281 article abstracts published between 1969 and 2002, with a majority of articles (230 articles) published in the recent 5 years. The earliest one was written by H. Luhn on the topic “A business intelligence system” in 1969 ^[44]. He was considered to be a pioneer in developing business intelligence systems. Based on the keywords appearing in the titles and abstracts, we identified the following nine key terms: knowledge management, information management, database technology, customer relationship management, enterprise resource planning, supply chain management, e-commerce solution, data warehousing, business intelligence technology. They become the nine key topics on business intelligence and are shown on the front page of system's user interface shown in Figure 2. Graphic: Business Intelligence Explorer: A knowledge map framework

Figure 1.Figure 1.

Graphic: User interface of Business Intelligence Explorer

Figure 2.Figure 2.

3.2. Meta-searching and Web Page Filtering

Using the nine business topics, we performed meta-searching on seven major search engines: Alta Vista, All the Web, Yahoo, MSN, LookSmart, Teoma, Wisenut. They are the major search engines that are also used by Kartoo, which was compared with the knowledge map of our system. Kartoo is a meta-search engine that presents results in a map format. We want to create a collection that is comparable to the one used by Kartoo. From each of the seven search engines, we collected the top 100 results. As page redirection is used in the front page of many Web sites, our spider automatically followed these URLs to fetch the pages that were redirected. Since we are only interested in business Web sites, URLs from educational, government, and military domains (with host domain edu, gov, “mil” respectively) were removed. Further filtering was applied to remove non-English Web sites, academic Web sites that do not use the “edu” domain name, Web directories and search engines, online magazines, newsletters, general news articles, discussion forums, case studies, etc. Totally we collected 3,149 Web pages from 2,860 distinct Web sites, or around 350 Web pages for each of the nine topics. Each Web page represents one Web site.

3.3. Automatic Parsing and Indexing

Since Web pages contain both textual content and HTML tag information, we need to parse out this information to facilitate further analysis. In this step, we used a parser to automatically extract key words and hyperlinks from the Web pages collected in the previous step. A stop word list of 444 words was used to remove non-semantic-bearing words (e.g. <the=, <a=, <of=, <and=). Using HTML tags (such as <TITLE>, <HI>, <IMG SRC= 'car.gif alt= ‘Ford’», the parser also identified the type of words and indexed the words appearing in each Web page. Four types of word are identified (in descending order of importance): title, heading, content text, and image alternate text. If a word belongs to more than one type, then the most important type is used to represent that term in the Web page. The word type information was used in the co-occurrence analysis step (discussed below). Then we used Arizona Noun Phraser (AZNP) to automatically extract and index all the noun phrases from each Web page based on part-of-speech tagging and linguistic rules ^[45]. Developed at the University of Arizona, AZNP has three components. The tokenizer takes the full text of each Web page as input and creates output that conforms to the Penn Treebank word tokenization rules by separating all punctuation and symbols from text ^[46]. The tagger module assigns part-of-speech to every word in the Web page. The last module, called the phrase generation module, converts the words and associated part-of-speech tags into noun phrases by matching tag patterns to noun phrase pattern given by linguistic rules. For example, the phrase strategic knowledge management will be considered a valid noun phrase because it matches the noun phrase rule: adjective + noun + noun. Then, we treated each key word or noun phrase as a subject descriptor. Based on a revised automatic indexing technique ^[47], we computed the importance of each descriptor or term in representing the content of the Web page. We measured the term's level of importance by term frequency and inverse Web page frequency. Term frequency measures how often a particular term occurs in a Web page. Inverse Web page frequency indicates the specificity of the term and allows terms to acquire different strengths or levels of importance based on their specificity. A term can be a one-, two-, or three-word phrase.

3.4. Co-occurrence analysis

Co-occurrence analysis converts the raw data (indexes and weights) obtained from the previous step into a matrix that shows the similarity between every pair of Web sites. The similarity between every pair of Web sites contains the content and structural (connectivity) information. He et al. computed the similarity between every pair of Web pages by a combination of hyperlink structure, textual information and co-citation ^[15]. However, their algorithm places a stronger emphasis on co-citation than hyperlink and textual information. When a hyperlink does not exist between the pair of Web pages, then the similarity weight only includes co-citation weight, even if their textual content are very similar. The same situation appears when no common word appears in the pair of Web pages, even if many hyperlinks exist between them. In order to impose a more flexible weighting in the three types of information, we modified He et al.'s algorithm to find the similarity. Figure 3 shows the formulae used in our co-occurrence analysis. We normalized each of the three parts in computing the similarity and assigned a weighting factor to each of them independently. We computed the similarity of textual information $(S_{ij})$ by asymmetric similarity function which was shown to perform better than cosine function ^[1]. When computing the term importance value $(d_{ij})$ , we included a term type factor which reflects the importance of the term inside a Web page. Using the formulae in Figure 3, a similarity matrix for every pair of Web sites in each of the nine business intelligence topics was generated.

3.5. Identifying Web communities

We define a Web community as a group of Web sites that exhibit high similarity in their textual and structural information. The term “Web community” is originally used by researchers in Web structure mining to refer to a set of Web sites that are closely related through the existence of links among them ^[48]–^[51]. In contrast, researchers in Web content mining prefer to use the term “cluster” to refer to a set of Web pages that are closely related through the co-occurrence of keywords or phrases among them ^[40],^[52]. We chose to use “Web community” to refer to our groups of similar Web sites because it connotes the meanings of common interests and mutual references (but not just a group of similar objects as connoted by <cluster=). To identify Web communities for each business intelligence topic, we model the set of Web sites as a graph consisting of nodes (Web sites) and edges (similarities). Based on previous researches, hierarchical and partitional clustering are applicable for different purposes but it is not likely that any one method is the best ^[53]. Moreover, contradictory results have been obtained from previous studies on comparing clustering methods ^[54]. Here, we decided to choose a combination of hierarchical and partitional clustering so as to obtain the benefits of both methods. We used a partitional cluster method to recursively partition the Web graph in order to create a hierarchy of clusters. This way, we could obtain the clustering quality of partitional method while being able to show the results in a visual dendrogram. However, partitional clustering is computationally intensive. As graph partitioning has been shown to be NP-complete ^[55], search heuristics are required to find good solutions. To obtain high quality of clustering using partitional methods, we need to use an optimization technique that finds the “best” partition point in each cluster task. Being a global search technique, genetic algorithms (GA) can perform a more thorough space search than some other optimization techniques (such as taboo search and simulated annealing). GA is also suitable for large search space such as the Web graph. Therefore, we selected GA as the optimization technique in our graph partitioning. During each iteration, the algorithm tries to find the way to bipartition the graph such that certain criterion (the fitness function) is optimized. Based on previous work on Web page clustering ^[15] and image segmentation ^[16], we used a normalized cut criterion in finding the best partitioning. The normalized cut criterion measures both the total dissimilarity between different partitions as well as the total similarity within the partitions. It has been shown to outperform the minimum cut criterion, which favors cutting small sets of isolated nodes in the graphs ^[56]. Graphic: Formulae used in co-occurrence analysis

Figure 3.Figure 3.

Graphic: A simplified example of GA graph partitioning

Figure 4.Figure 4.

In our GA partitioning, we used normalized association (which equals (2 - normalized cut) ^[16]) as the fitness function and the GA tried to maximize the function value. Figure 4 illustrates how GA works by a simplified example. Suppose we set the maximum number of levels in the hierarchy to be 2 and the maximum number of nodes in the bottom level to be 5. Ten nodes are initially partitioned into graph A and graph B, which are on the first level of the hierarchy. Since graph B contains less than 5 nodes, the partitioning stops after the first iteration. Graph A continues to be partitioned to create graph C and D, which are on the second level of the hierarchy. Then the whole procedure stops because the maximum number of levels and the maximum number of nodes in the bottom level have been reached. The graphs (A, B, C, D) partitioned in the process is considered to be Web communities. In our actual Web site partitioning, the maximum number of levels is 5 and the maximum number of nodes in the bottom level is 30. Web communities are labeled by the top 10 phrases with the highest term importance value $(d_{ij}$ shown in Figure 3). Manual selection among these 10 phrases was used to select the one that best described the community of Web sites.

3.6. Creating knowledge map

The term “map” has been used in different contexts to refer to landscape of Web services ^[57], map describing mental concepts ^[34], map revealing groups of documents ^[9]. We herein define a knowledge map as a spatial map display showing the underlying patterns of Web sites in terms of their similarity. To create a knowledge map, we used multidimensional scaling (MDS) to transform a high-dimension similarity matrix into a 2-dimensional representation of points and displayed them on a map. As described in our literature review, MDS has been applied in different domains for visualizing the underlying structure of data. We used Torgerson's classical MDS procedure which does not require iterative improvement ^[27]. The procedure was shown to work with non-Euclidean distance matrix (such as the one we used here) by giving approximation of the coordinates ^[24].

Figure 5–7 show the screen shots of different browsing methods provided by Business Intelligence Explorer , and Figure 8 shows Kartoo map display, which was used to compare with our knowledge map display. To study the browsing capability of our knowledge map and Web community, we also implemented a result list display which mimics typical search engine list display. Graphic: Result list browsing method

Figure 5.Figure 5.

Figure 6.Figure 6.

4. Preliminary User Study

This section reports the observational results of a preliminary user study on our prototype. We conducted a study with 5 users to compare our knowledge map (KM) and Web community (WC) with result list (RL) and Kartoo (http://www.kartoo.com/). Kartoo (KT) was selected to compare with knowledge map because it is the only search engine we found that displays results in a map format, that most resembles knowledge map display. We asked each user to use each of the four browsing methods to perform browsing tasks and to give comments on each method. Graphic: Knowledge map browsing method

Figure 7.Figure 7.

Figure 8.Figure 8.

In general, users liked the interfaces of WC and KM because they allowed users to overview all the results before getting into details. Regarding users' comments on the strengths and weaknesses of RL, users were familiar with RL's display of results because it was similar to typical search engines' display. However, RL provided too much information and might create information overload. As user #4 said that RL required “too much reading at one time” and it was hard to search for a specific word or phrase.

Regarding the comments on WC, users liked the guiding labels, clustering and visual effects. As user #3 said Once I spot the label, I can move to the relevant topics very easily … (WC) save(s) time, (I) don't need to read all the summaries and Web pages to decide which are relevant. They also pointed out that clustering Web sites helped them find the results faster. Regarding WC's visualization effects, user #1 said: “visualization helps to navigate faster and easier”. As of the weaknesses of WC, two users said that the labels were overlapped and looked crowded when they browsed at the root level.

As of the comments on KM, several users showed their preferences on KM because it allowed them to visualize the landscape intuitively in the format of a map. In addition, they liked the zooming and navigation functions of KM which, unlike WC, did not require them to open the nodes to view the details of results. As of the comments on KT, most users liked its professional graphical designs while two users pointed out that the flashing links, circles and keywords made them confused when browsing on KT. They also expressed that KT's presentation created information overload.

5. Summary and Future Directions

In this paper, we introduced Business Intelligence Explorer (BIE), a tool that implements the steps in a knowledge map framework for discovering business intelligence on the Web. The tool applied techniques in content collection, text-mining, and document visualization to address the problem of information overload on the Web. Two browsing methods were developed: Web community and knowledge map. Results from our preliminary user study are encouraging. Users liked the clustering and visualization capabilities of Web community while they found knowledge map's intuitive meaning of point placement helpful in reducing information overload, which was found in result list display and Kartoo map display, a commercial search engine with graphical result display.

A future direction is to conduct a full-scale empirical evaluation to study the effectiveness, efficiency and usability of our tool. It will also be useful to explore faster algorithms in different stages of the document visualization process in BIE, so that the prototype can evolve to be a real-time business intelligence analyzer. Currently, a limitation in BIE is the high computational intensity in co-occurrence analysis and identifying Web community. Faster algorithms can be explored to increase the speed of computation. In addition, new visualization metaphors such as 3D displays and animations can be studied for Web browsing.

Footnotes

1 It was found that the world produces between 635,000 and 2.12 million terabytes of unique information per year.

References

[1]H. Chen and K. J. Lynch, “Automatic construction of networks of concepts characterizing document databases”, IEEE Transactions on Systems, Man, and Cybernetics, vol. 22, pp. 885–902, 1992.
[2]P. Lyman and H. Varian, “How much information”, available athttp://www.simsberkeley.edu/how-much-info, 2000.
[3]F. Group, “Ostriches & Eagles 1997,” in The Futures Group Articles. 1998.
[4]P. H. J. Davies, “Intelligence, information technology, and information warfare”, Annual Review of Information Science and Technology, vol. 36, pp. 313–352, 2002.
[5]R. Carvalho and M. Ferreira, “Using information technology to support knowledge conversion processes”, Information Research, vol. 7, 2001.
[6]C. W. Choo, The Knowing Organization. Oxford: Oxford University Press, 1998.
[7]L. Fuld, K. Sawka, J. Carmichael, J. Kim and K. Hynes, Intelligence Software Report™ 2002. Cambridge, MA, USA: Fuld & Company Inc., 2002.
[8]B. Shneiderman, “The eyes have it: a task by data type taxonomy for information visualization”, presented at Proceedigas of Visual Languages, Boulder. CO., 1996.
[9]X. Lin, “Map displays for information retrieval”, Journal of the American Society for Information Science, vol. 48, pp. 40–54, 1997.
[10]D. R. Cutting, D. R. Karger, J. O. Pederson and J. W. Tukey, “Scatter/gather: a cluster-based approach to browsing large document collections”, presented at Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York. 1992.
[11]L. B. Doyle, “Semantic road maps for literature searcher”, Journal of the Association of Computing Machinery, vol. 8, pp. 553–578, 1961.
[12]J. A. Wise, J. J. Thoma, K. Pennock, D. Lantrip, M. Pottier, A. Schur and V. Crow, “Visualizing the non-visual: spatial analysis and interaction with information from text documents”, presented at IEEE, Proceedings of Information Visualization, 1995.
[13]R. Spence, Information Visualization: ACM Press, 2001.
[14]O. Etzioni, “The World-Wide Web: Quagmire or Gold Mine?”, Communications of the ACM, vol. 39, pp. 65–68, 1996.
[15]X. He, C. Ding, H. Zha and H. Simon, “Automatic topic identification using Webpage clustering”, presented at Proceedings of 2001 IEEE International Conference on Data Mining, Los Alamitos, CA, 2001.
[16]J. Shi and J. Malik, “Normalized cuts and image segmentation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 888–905, 2000.
[17]K. Bharat and M. R. Henzinger, “Improved Algorithms for Topic Distillation in Hyperlinked Environments”, presented at Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998.
[18]S. Chakrabarti, B. Dom K. S., P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson and J. Kleinberg, “Mining the Web's link structure”, IEEE Computer, vol. 32, pp. 60–67, 1999.
[19]H. Chen, H. Fan, M. Chau and D. Zeng, “MetaSpider: Meta-searching and categorization on the Web”, Journal of the American Society for Information Science and Technology, vol. 52, pp. 1134–1147, 2001.
[20]E. Selberg and O. Etzioni, “The MetaCrawler architecture for resource aggregation on the Web”, IEEE Expert, vol. 12. pp. 8–14. 1997.
[21]C. Palmer, J. Pesenti, R. Valdes-Perez, M. Christel, A. Hauptmann, D. Ng and H. Wactlar, “Demonstration of hierarchical document clustering of digital library retrieval results”, presented at Proceedings of the 1st ACM/IEEE Joint Conference on Digital Libraries, Roanoke, VA, USA, 2001.
[22]A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Engewood Cliffs. N.J. USA: Prentice-Hall.1988.
[23]J. Grabmeier and A. Rudolph, “Techniques of cluster algorithms in data mining”, Data Mining and Knowledge Discovery, vol. 6, pp. 303–360, 2002.
[24]F. W. Young, Multidimensional Scaling: History, Theory, and Applications. Hillsdale, NJ, USA: Lawrence Erlbaum Associates. Publishers. 1987.
[25]M. W. Richardson, “Multidimensional psychophysics (Abstract)” Psycological Bulletin. vol 35, pp. 659, 1938.
[26]G. Young and A. S. Householder, “Discussion of a set of points in terms of their mutual distances”, Psychometrika, vol. 3, pp 19–22, 1938.
[27]W. S. Torgerson, “Multidimensional scaling: I. Theory and Method.” Psychometrika. vol. 17. pp. 401–419. 1952.
[28]J. B. Kruskal, “Nonmetric multidimensional scaling: a numerical method”, Psychometrika, vol. 29, pp. 115–129, 1964.
[29]Y. Takane, F. W. Young, and J. de Leeuw, “Nonmetric individual differences multidimensional scaling: an alternative least squares method with optimal scaling features” Psychometrika. vol. 42. pp. 7–67. 1977.
[30]Y. He and S. C. Hui, “Mining a Web citation database for author co-citation analysis”, Information Processing and Management, vol. 38, pp. 491–508, 2002.
[31]S. B. Eom and R. S. Farris, “The contributions of organizational science to the development of decision support systems research subspecialties”, Journal of the American Society for Information Science, vol. 47, pp. 941–952. 1996.
[32]M. J. McQuaid, T. H. Ong, H. Chen and J. F. Nunamaker, “Multidimensional scaling for group memory visualization”, Decision Support Systems, vol. 27, pp. 163–176.1999.
[33]H. Kanai and K. Hakozaki, “A browsing system for a database using visualization of user preferences”, presented at Proceedings of the 2000 IEEE International Conference on Computer Visualization and Graphics, Los Alamitos, CA, USA, 2000.
[34]W. A. Kealy, “Knowledge maps and their use in computer-based collaborative learning”, Journal of Educational Computing Research, vol. 25, pp. 325–349, 2001.
[35]J. D. Novak and D. B. Gowin, Learning How to Learn. New York: Cambridge University Press. 1984.
[36]T. Buzan and B. Buzan, The Mind Map Book: How to Use Radiant Thinking to Maximize Your Brain's Untapped Potential. New York: Plume Books (Penguin), 1993.
[37]E. Rennison, “Galaxy of news: an approach to visualizing and understanding expansive news landscapes”, presented at Proceedings of ACM Symposium on User Interface Software and Technology, 1994.
[38]T. Kohonen, Self-organizing maps. Berlin: Springer-Verlag.1995.
[39]X. Lin, D. Soergel and G. Marchionini, “A self-organizing semantic map for information retrieval”, presented at Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, Chicago, IL, 1991.
[40]H. Chen, C. Schuffels and R. Orwig, “Internet categorization and search: a self-organizing approach”, Journal of Visual Communication and Image Representation. vol. 7, pp. 88–102, 1996.
[41]H. Chen, A. Houston, R. Sewell and B. Schatz, “Internet browsing and searching: user evaluation of category map and concept space techniques”, Journal of the American Society for Information Science, Special Issue on AI Techniques for Emerging Information Systems Applications, vol. 49, pp. 582–603, 1998.
[42]C. C. Yang, H. Chen and K. Hong, “Internet browsing: visualizing category map by fisheye and fractal views”, presented at Proceedings of the IEEE International Conference on Information Technology: Coding and Computing, Los Alamitos, CA, USA, 2002.
[43]J. F. Nunamaker, M. Chen and T. Purdin, “Systems development in information systems research”, Journal of Management Information Systems, vol. 7, pp. 89–106, 1991.
[44]H. P. Luhn, “A business intelligence system”, in Pioneer of information science, selected works. London, UK: Macmillan, 1969, pp. 132–139.
[45]K. M. Tolle and H. Chen, “Comparing noun phrasing techniques for use with medical digital libraray tools”, Journal of the American Society for Information Science (Special Issue on Digital Libraries), vol. 51, pp. 352–370, 2000.
[46]M. Marcus, “http://www.cis.upenn.edu/~treebank/tokenization.html”, University of Pennsylvania, 1999.
[47]G. Salton, Automatic Text Processing: the transformation, analysis, and retrieval of information by computer. Reading, MA: Addison-Wesley, 1989.
[48]G. W. Flake, S. Lawrence and C. L. Giles, “Efficient identification of Web communities”, presented at Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 2000.
[49]G. W. Flake, S. Lawrence, C. L. Giles and F. M. Coetzee, “Self-organization and identification of Web communities”, IEEE Computer, vol. 25, pp. 66–71, 2002.
[50]D. Gibson, J. Kleinberg and P. Raghavan, “Inferring Web communities from link topology”, presented at Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space - Structure in Hypermedia Systems, Pittsburgh, PA, 1998.
[51]R. Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins, “Trawling emerging cyber-communities authomatically”, Proceedings of the 8th WWW Conference, vol. Elsevier Science, pp. 403–415, 1999.
[52]B. Schatz, “The Interspace: concept navigation across distributed communities”, IEEE Computer, vol. 35, pp. 54–62, 2002.
[53]J. A. Hartigan, “Statistical theory in clustering”, Journal of Classification, vol. 2, 1985.
[54]G. W. Milligan, “A Monte-Carlo study of 30 internal criterion measures for cluster-analysis”, Psychometrika, vol. 46, pp. 187–195, 1981.
[55]M. Garey and D. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness:W. H. Freeman, 1979.
[56]Z. Wu and R. Leahy, “An optimal graph theoretic approach to data clustering: theory and its application to image segmentation”, IEEE Transactions on Pattern Analysis and Machine Intellieence. vol. 15. pp. 1. 101–1. 113. 1993.
[57]M. Dodge and R. Kitchin, Atlas of cyberspace. Harlow, England: Addison-Wesley, 2001.