A peak at Natural Language Processing
NLP - NLU - Computational Linguistics - Text Mining - Automated Understanding - Information Extraction - This field has many names.
First of all - we are not talking about Neuro-Linguistic Programming! Sorry…
The World Wide Web really made obvious the difficulty of the problem of understanding of unstructured data. Information (let us not even think of knowledge at the moment) is very difficult to be pulled out of unstructured textual data. Before the web, pretty much all data that people were working with on computers was in databases and generally the data was structured. Obviously, there were pockets of textual data in electronic format out there, but realistically most textual data was on paper or in WordPerfect (remember WordPerfect or even WordStar?) documents residing on unconnected PCs. So, a good deal of AI research really concentrated on STRUCTURED data - and who can blame them?
Then came the web.
More unstructured textual data than we could have ever imagined all accessible via the web. Now capabilities that would allow computers to extract information from this vast source of unstructured data became really important. Okay, so what is Natural Language Processing (NLP)? I like this definition from http://nlp.shef.ac.uk/: NLP “…is the use of computers to process written and spoken language for some practical, useful, purpose: to translate languages, to get information from the web on text data banks so as to answer questions, to carry on conversations with machines…”
Notice the inclusion of “spoken language” as well as written language in this definition of NLP - that is a whole other discussion. Let’s just assume that we are talking about unstructured text that has been captured electronically. Just capturing human speech and converting it into text (let alone automatically understanding it) is a field by itself, so just to simplify, we make the above assumption.
What types of research or application is going on in NLP?
Textual Search - Gasp! This isn’t NLP, is it? Well, yes and no. In the spectrum of understanding unstructured data, simple key word searching is on the low-tech end of the spectrum. Is Google using NLP? I am sure they would tell you that they are and they are at the edge of the Information Extraction field by virtue of sorting search results by PageRank, but again on the low end of the scale. What Google has on their side is a massive number of documents (over 8.1 Billion), indexed to quickly find keywords, ten thousand machines brute forcing this search and an innovative (at least it was back in 1998) way to rank documents based on key word relevance. Not to beat a dead horse, but the whole idea of AI is to do things smarter, not harder - Google is wonderful, but technology from 1998 is not going to be able to carry it too much longer.
Information Extraction - Pulling relevant and specific information from text. For instance, culling your vast store of emails for very specific information. People - Places - Things - Dates - Times - these are the types of things that would be useful for an office worker. A process that automagically extracts this information from text, such as emails or even instant messages. So, here we have a process that is generally immutable - a predefined type of information - a person - a place - an event - is being extracted from text. This differs from keyword searching because it constrains the user - in text search the user can search for any keyword, however, the results may not be useful for what they user was trying to find out. IE is making the question much more clearly defined: Give me all the NAMES of PEOPLE in this document. Give me all the NAMES of LOCATIONS in this web page. Give me all the PRICES of COMPUTERS in this web site. Specific questions, specific answers.
Just another set of tools to add to your toolbox for the analysis of the markets. How do you do it? How do you do it effectively?
