Unstructured data is a catch all term used to refer to data with no pre-defined arrangement or data model that dictates the layout of the information contained. Such data is typically very text heavy and often free form or open ended in nature, e.g. blog posts, survey responses, or whole word processed documents. Typically 70%-90% of all data within in an organisation will exist in an unstructured form. In addition, with the rise of social media, blogging and electronic storage of documents, there is now a wealth of such unstructured text data available for analysis from outside sources.
Analysing such text sources in order to gleam business insight is a challenging problem that many companies now face. This is due to the irregularities and ambiguities in written language that make it difficult to understand using traditional programs used to dealing with more structured data sources e.g. databases. Fortunately there are many tools and approaches to conduct this type of analysis which draw upon techniques from Natural Language Processing, or NLP, which is a field of research that aims to teach computers to process and extract meaning from human language. Some common tasks in NLP include:
- Sentiment Analysis – Assessing the overall polarity of a document based on its use of positive or negative words.
- Information Extraction – A broad area that is concerned with mining for important named entities, the phrases they co-occur with, and relationships between entities.
- Topic Modelling – Given a collection of documents, automatically categorise them into a number of different topics.
- Machine translation – Automatically translate documents based on statistical models of the two languages concerned.