Unstructured Data: Creating the Analytical Environment

Venue: To be advised

Location: London, United Kingdom

Event Date/Time: Dec 04, 2008 End Date/Time: Dec 05, 2008
Report as Spam


It is estimated that 80% of data in the corporation is textual. There are emails, medical records, contracts, safety reports, patents, and a whole host of other forms of textual data. For years, managing textual data has meant placing documents in some form of ECM – Enterprise Content Management. But trying to do analysis on data found in ECM is a very different story than placing the documents there in the first place.

With DW 2.0 the idea arose that unstructured data is best placed in a data warehouse, where it can be analyzed along with other structured data found in the data warehouse.

This seminar is about the work that needs to be done in order to take textual data out of the confines of documents and integrate the textual data into a data warehouse. This is a very down to earth seminar/work shop. The first day is a lecture based on the background material needed to understand the architecture surrounding the placement of text in an analytical, data warehouse environment. The second day is a workshop that shows – step by step – how text is converted into a data base that can then be placed into a data warehouse.

There is a big difference between searching text and analyzing text. The seminar brings out these important distinctions.

The hardest part of transforming text into a data warehouse is the integration of the text. Anyone can read a text file and toss the text into a data base. Such an exercise is an exercise in futility. The resulting data base is one that cannot be usefully processed by a BI tool. In order to produce a meaningful result, the analyst must carefully transform the text. Some of the basic issues of transformation include:

reading and understanding semi structured data
applying external categories to text
creating internal taxonomies of text
standardizing dates for BI processing
identifying patterned variables
identifying named variables
resolving homographs, and so forth.
There is a special emphasis on the management of corporate contracts and oil and gas pipeline and refinery safety data in this seminar.