Welcome to MSDN Blogs Sign in | Join | Help

Concept-Based Navigation of the World-Wide Web

While digging through some old files, I found this paper that I co-authored with Rick Kazman back in '97.  We never published it, but I thought it would be interesting to post...

Concept-Based Navigation of the
World-Wide Web

S. Jeromy Carrière, Rick Kazman

Software Engineering Institute

Carnegie Mellon University

Pittsburgh, PA 15213

{sjc,rkazman}@sei.cmu.edu

Introduction and Background

When using the World Wide Web, how often do you attempt to navigate from the page that you’re currently browsing to some interesting and related, but hypothetical page? If you’re browsing a page about object-oriented design, will you attempt to navigate to a page about programming language theory? Probably not. As we have all learned (or taken for simple fact), the web is not a navigable structure: from any given "node", there is really no way to know which outgoing link to follow to achieve the goal of reaching another node. Such a link may not even exist. So, are we forever doomed to the world of the search engine? We hope not. One potential solution is that of concept-based navigation: rather than navigating in Web-space, following only user-provided pointers, we advocate navigating through a space of concepts. These concepts characterize Web pages and point the user back into Web-space.

 

Effective View Traversibility

Furnas [Furnas 97] identifies two properties that contribute to the navigability of an information structure. These are Effective View Traversibility (EVT) and View Navigability (VN).

The EVT of a structure is defined in terms of two requirements:

The number of outgoing links form a given node is small relative to the overall size of the structure being viewed.

The distance between pairs of nodes in the structure is small relative to the overall size of the structure being viewed.

In a worst case analysis, we consider the maximum number of outgoing links of any node in the structure and the longest path required to connect any two nodes. An example is a simple "list box" of the sort used in many graphical user interfaces, in which a subset of the overall list is visible at any given time. In this case, the number of outgoing links is fixed: the number of visible items; this is presumably small relative to the total number of items available. The distance between pairs of nodes is, however, linear in the total number of items: one must take steps through the list as determined by the size of the visible list. For this reason, a scrolling list does not provide an effectively traversable view. A balanced tree, on the other hand, may be considered effectively traversable: the maximal number of outgoing links is still fixed and small relative to the overall size of the structure and the longest path between two nodes is limited by the depth of the tree, which is logarithmic in its total size.

View Navigability

View navigability addresses the issue suggested above: given a structure to navigate, how do we choose the next step along a path to our desired destination? In an EVT structure, this path should be short, but how are we to find it? Furnas defines view navigability in terms of the "outlink-info" for a given node. The outlink-info for a given outgoing link suggests which target nodes are efficiently reachable via that link. For a structure to be view navigable, we require that the outlink-info for a node, as a whole, is not misleading (that is, the outlink-info will not suggest a non-optimal path for any target node) and that every target appears in the outlink-info for the node.

This is the problem. How are we to construct this outlink-info such that every node in the structure (remember, we’re talking about the web) is represented at every other node?

Furnas further characterizes a structure as Effectively View Navigable if it meets the requirements for both effective view traversibility and view navigability.

Concept Clustering

The first step toward a solution is the idea of "concept clustering" [Kominek 97]. Concept clustering is an efficient mechanism for the identification of a set of concepts occuring within a body of text. The process requires the existence of an ontology: an organization of the concepts into an "is-a" hierarchy, from a root concept general enough to describe anything to specific concepts that (ideally) may not be specified further. The ontology currently in use is an "is-a" hierarchy of approximately 60000 concepts, derived from WordNet [Kominek 97]. The key point here is that a concept hierarhy of this form is effectively view navigable. It meets the requirements of EVT: the maximum number of outgoing links from a concept is small (more than 99% of concepts have fewer than 50 outgoing links) and the length of the longest path is logarithmic in the size of the tree (the depth of the tree is 16). It also meets the requirements of VN, as follows. At any node (concept) in the tree, there are two choices for the next step toward a target concept: we may choose to go up the tree or we may choose a child concept that is a specialization of the current concept. Thus we may define the outlink-info for a concept to be "choose the child concept most like the target concept" if the target is a specialization of the current concept and "go up" otherwise.

Effective View Navigability and the WWW

Making the Web Navigable

A technique for improving the navigability and traversibility of a structure is the augmentation of the structure with another structure that does exhibit these properties. Navigation through the augmented structure provides a mechanism for navigation through the original structure, but in an effective fashion. So, how can we augment the web with a navigable structure? Put simply, we may translate web navigation into concept navigation by navigating to a set of target concepts and identifying web pages that are characterized by these concepts. In other words, navigation from one web page to another has been reduced to the problem of navigating to a set concepts of interest.

In navigating the concept hierarchy for the purposes of web navigation, we are not performing a single task of navigation. Instead, we are attempting to navigate to a set of concepts that together characterize our target of interest. In effect, we are transforming the problem of effective view navigation into one of concept clustering.

Implementation

To this end we have been working on a prototype, called Cheddar, to test navigating the concept space represented by the Web via concept clustering. A prerequisite for the operation of the prototype is the collection and analysis (concept clustering) of a large number of web pages. The task of web indexing is, of course, a venture not undertaken lightly. A number of academic and commercial organizations have devoted and are devoting a great deal of time and money to the problem. We have taken the somewhat less ambitious approach of downloading and analyzing a set of pages in a particular locality, in a hope to verify the ideas before tackling the more general problem. Initial data collection was performed by collecting pages at most two links from the Yahoo "Business_and_Economy/Companies/Real_Estate" page. Just under 2000 pages were collected. After converting HTML to text, concept clustering was applied to generate a concept set for each page.

The user interface is implemented as a Java application. Because the application requires online access to the entire 60000 node concept hierarchy, the use of Java posed some interesting technical problems. For example, the garbage collector in the Sun Java Developer’s Kit virtual machine employs sweeping—a time consuming process when memory allocation is significant. These long garbage collection delays are unacceptable in an interactive application, particularly when garbage collection is frequent. Frequent garbage collection is the norm when memory is allocated and deallocated often, as is characteristic of graphical applications, particularly in Java. Thus the application was separated into two processes, one to act as a server for the concept hierarchy and the other to provide user interaction. Because the concept server process has a fairly static memory signature it requires infrequent garbage collection; the dynamic memory usage of the user interface thus remains sufficiently fast for interactive use.

<picture should go here ... >

The fundamental idea in the user interface is the "slice": a region of interest in the concept tree. At any given time, the interface may be displaying a number of slices, each containing a set of related concepts that have been selected as interesting by the navigator. A slice is created by specifying a focus concept, and initially displays concepts within a fixed locality around the focus concept. At the moment, this locality is defined to be all concepts at most two links from the focus. The snapshot above shows the Cheddar application displaying two slices; the foci are highlighted: shore and apartment. As you move outward from the center of the window, concepts move from general to specific. For example, seashore is more specific than shore which is more specific than natural_object.

The user can move through the concept tree by double-clicking a concept, making it the focus, or by control-clicking to have the children of a concept displayed. Because no explicit links are shown between concepts (to keep the visualization of the slices uncluttered), the user may determine parent-child relationships between concepts by shift-clicking on a concept; the parent and children of the concept are highlighted.

In a slice, concepts other than the focus may be selected. As the user navigates through the concept tree, selected concepts remain in the slice. If a concept is deselected and it is no longer in the locality of the focus, it is removed from the slice. As the user navigates using several slices, it is possible for slices to collide. When such a collision occurs, the slices are merged with selections from each preserved in the new slice. In addition, slices may be split; a new slice is created for every selected concept with it as the focus.

The user interface includes a second window that displays the currently selected concepts along with the URLs of Web pages that contain them (that is, their conjunction):

Selecting a URL from the list causes all visible concepts contained within the associated page to be selected. Double-clicking a URL causes an external HTML browser to be invoked to display the page.

In an effort to manage the amount of information displayed to the user, we currently only display concepts that have URLs associated with them. Because our initial data collection was sparse, this elision causes a navigational problem: it is possible for the user to get "stranded", that is, unable to navigate because no parent or child concepts are currently visible. This problem is simply one of scale however, and goes away when a larger (and hence richer) set of Web pages is considered.

Conclusions

Obviously, a great deal of experimentation and refinement is required to validate both the ideas and the user interface presented here. Our current stumbling block, as mentioned above, is a lack of data. We intend to proceed, in the short term, with the approach of collecting data in a particular locality of the Web for further experimentation. Beyond refinement of the user interface, concept clustering-related work is required: there are many parameters that control the analysis and the ontology provided by WordNet is fairly limited.

However, we firmly believe that concept-based navigation of the World Wide Web will alleviate many of the problems faced by its users.

References

[Furnas 97] G. Furnas, "Effective View Navigation", Proceedings of CHI 97. Available WWW: <http://www.acm.org/sigchi/chi97/proceedings/paper/gwf.htm>.

[Kominek 97] J. Kominek, R. Kazman, "Accessing Multimedia Through Concept Clustering", Proceedings of CHI 97. Available WWW: http://www.acm.org/sigchi/chi97/ proceedings/paper/rnk.htm

Published Tuesday, November 02, 2004 9:00 PM by jeromyc

Comments

Anonymous comments are disabled
 
Page view tracker