The World Wide Web has become the primary repository for all
information: scientific, business and entertainment. Too much
information is worse than not enough information because of
the lack of any central authority to validate and authenticate
the content. The authors of the Web pages are not aware of the
ways their content is used. The innocent information published
on the Web could be used for a malicious purpose. Some implications
of our approach are that the author of a webpage cannot completely
define that document's semantics and that semantics emerge through
use. Contextual document semantics emerge through identification
of various users' browsing paths though this multimedia collection.
In this paper, we present techniques that use multimedia information
as part of this determination. This effort is an attempt to
derive the semantics of web pages using the users' browsing
paths. The effort includes analysis of the link information
along with the actual users' navigation paths to derive emergent
semantics of the Web pages that may not have been intended by
the author of the web page. Each Web page has some meaning that
can be derived from the static link analysis of the connected
graph generated using the incoming and outgoing links. This
is the approach used by most search engines. Our research effort
is focused on the dynamic link analysis of the users' browsing
paths.
The primary goal is to derive the emergent semantics of the
page(s) using the Web browsing patterns of the users. The ultimate
goal is to derive the semantics of the browsing path of the
user. In case of a search engine, a user enters a query string
and the search engine retrieves a list of URLs that match the
query in the order or relevance. Our research can be considered
as the reciprocal of a search engine: where the problem is to
derive the semantics given the ordered sequence of Web pages
visited by a user. Using an iterative process, we derive the
semantic breakpoints of long browsing paths. This identifies
short sub-paths with uniform semantics. Using the coherent uniform
semantics exhibited by the sub-path, we attempt to derive the
high level semantics from the Web activity of a user. With additional
training data, specific application of this research leads to
the terrorist trend detection. The Web usage log data is used
for this analysis. Using WordNet database for high-level concepts,
we attempt to derive high-level semantics. Preliminary research
results show promising results.
A typical user these days has several windows open and has
several browser sessions, several instant messaging windows
and at least one email client. The event driven activity of
the user can be analyzed only when every single activity of
the user is monitored instead of just analyzing the Web usage
logs as studied by most Web mining efforts. The context of the
browsing activity is entirely dependent on the various other
events occurring from the various applications opened by the
user. Our research currently focuses on the Web browsing paths
only. It is just a matter of sniffing other ports to process
other network activity like instant messaging and emails. However,
we are ignoring the various other events that could contribute
to the activity like multiple computers at the user's desk,
a telephone call from another user, an event/alarm from a calendar/palm-pilot
etc.
|