| The paradigm of the Web is radically different
from the paradigm for the data warehouse. Adapting an old programming term, you might say
that web content is spaghetti data. That is, it can link to anything with little
discipline. Furthermore, web content is highly volatile and constantly changing. The Web's
diversity challenges our imagination and appreciation for new forms of creative
expression. The problem is cultivating from that diversity those few nuggets with real
business value. To do so requires a discipline to
transform raw data into validated information -- much like a farmer transforms seed into a
harvest. With web farming, this discipline is called Information Refining and
consists of four processes: discovery, acquisition, structuring, and dissemination.
Discovery is the exploration
of available Web resources to find those items that relate to specific topics. Discovery
involves considerable "detective" work far beyond searching generic directory
services (such as Yahoo) or indexing services (such as AltaVista). Furthermore, the
discovery activity must be a continuous process because data sources are continually
appearing (and disappearing) from the Web. A business analyst is the central figure in
this activity and requires advanced search and indexing tools to be productive.
Acquisition is the collection and
maintenance of content identified by its source. The main goal of acquisition is to
maintain the historical context so you can analyze content in the context of past changes.
Acquisition requires a secured server platform with large storage capacity.
Structuring is the analysis, validation, and transformation of content
into a more useful format and into a more meaningful structure. The formats can be Web
pages, spreadsheets, word processing documents, and database tables. As we move toward
loading data into a warehouse, the structures must be compatible with the star-schema
design and with key identifier values.
Dissemination is the packaging and delivery of information to the
appropriate consumers, either directly or through a data warehouse. It requires a range of
dissemination mechanisms from predetermined schedules to ad hoc queries. Newer
technologies such as information brokering and preference matching may be desirable.
There is a bi-directional flow to the processes. The left-to-right flow refines the
content of information, which becomes more structured and validated. The right-to-left
flow refines the control of the processes, which become more selective and discriminating. |