Natural Language Processing (NLP)

Text Analytics | Text Processing::
Generally, the data (textual data) looked like - it is simply a collection of characters, sentences .Starting 
with this raw data, you will go through various processes to understand the data - to make the raw data 
to the machine readable format.
Lexical Processing: First, you will just convert the raw text into words and, depending on your  application's needs, into sentences or paragraphs as well.
Syntactic Processing: So, the next step after lexical analysis is where we try to extract more meaning 
from the sentence, by using its syntax. Instead of only looking at the words, we look at the syntactic 
structures, i.e., the grammar of the language to understand what the meaning.
Semantic Processing: Lexical and syntactic processing don't suffice when it comes to building 
advanced NLP applications such as language translation, chatbots etc.. The machine, after the two 
steps  given above, will still be incapable of actually understanding the meaning of the text. At some 
point, your machine should be able to identify synonyms, antonyms, etc. on its own.
Text Encoding::
The data is collected in many languages. With so many languages in the world and internet being 
accessed by many countries, there is a lot of text in non-English languages. To work with non-English 
text, we need to understand how all the other characters are stored. Computers could handle numbers 
directly and store them on registers. But they couldn’t store the non-numeric characters as is. The 
alphabets and special characters were to be converted to a numeric value first before they could be 
stored. Hence, the concept of encoding came into existence. All the non-numeric characters were 
encoded to a number using a code. Also, the encoding techniques had to be standardized so that 
different computer manufacturers won’t use different encoding techniques.
The first encoding standard that came into existence was the ASCII (American Standard Code for Information Interchange) standard. ASCII standard assigned a unique code to each character of the keyboard which was known as  ASCII code. Later a new standard has come that is called - Unicode standard. It supports all the languages in the world .Suppose you are working on text processing, knowing how to handle encodings becomes is very important. So for any text processing, we need to know what kind of encoding the text has and if required, modify it to another encoding format. there are two most popular encoding standards: 1.American Standard Code for Information Interchange (ASCII), 2. Unicode :a.  UTF-8    b. UTF-16





















 

No comments:

Post a Comment

Element wise operation on LIST vs ARRAY

The use of arrays over lists: You can write  vectorised  code on numpy arrays, not on lists, which is  convenient to read and write, and con...