Text Analytics | Text Processing::
Generally, the data (textual data) looked like - it is simply a collection of characters, sentences .Starting
with this raw data, you will go through various processes to understand the data - to make the raw data
to the machine readable format.
Lexical Processing: First, you will just convert the raw text into words and, depending on your application's needs, into sentences or paragraphs as well.
Syntactic Processing: So, the next step after lexical analysis is where we try to extract more meaning
from the sentence, by using its syntax. Instead of only looking at the words, we look at the syntactic
structures, i.e., the grammar of the language to understand what the meaning.
Semantic Processing: Lexical and syntactic processing don't suffice when it comes to building
advanced NLP applications such as language translation, chatbots etc.. The machine, after the two
steps given above, will still be incapable of actually understanding the meaning of the text. At some
point, your machine should be able to identify synonyms, antonyms, etc. on its own.
Text Encoding::
The data is collected in many languages. With so many languages in the world and internet being
accessed by many countries, there is a lot of text in non-English languages. To work with non-English
text, we need to understand how all the other characters are stored. Computers could handle numbers
directly and store them on registers. But they couldn’t store the non-numeric characters as is. The
alphabets and special characters were to be converted to a numeric value first before they could be
stored. Hence, the concept of encoding came into existence. All the non-numeric characters were
encoded to a number using a code. Also, the encoding techniques had to be standardized so that
different computer manufacturers won’t use different encoding techniques.
The first encoding standard that came into existence was the ASCII (American Standard Code for Information Interchange) standard. ASCII standard assigned a unique code to each character of the keyboard which was known as ASCII code. Later a new standard has come that is called - Unicode standard. It supports all the languages in the world .Suppose you are working on text processing, knowing how to handle encodings becomes is very important. So for any text processing, we need to know what kind of encoding the text has and if required, modify it to another encoding format. there are two most popular encoding standards: 1.American Standard Code for Information Interchange (ASCII), 2. Unicode :a. UTF-8 b. UTF-16
Generally, the data (textual data) looked like - it is simply a collection of characters, sentences .Starting
with this raw data, you will go through various processes to understand the data - to make the raw data
to the machine readable format.
Lexical Processing: First, you will just convert the raw text into words and, depending on your application's needs, into sentences or paragraphs as well.
Syntactic Processing: So, the next step after lexical analysis is where we try to extract more meaning
from the sentence, by using its syntax. Instead of only looking at the words, we look at the syntactic
structures, i.e., the grammar of the language to understand what the meaning.
Semantic Processing: Lexical and syntactic processing don't suffice when it comes to building
advanced NLP applications such as language translation, chatbots etc.. The machine, after the two
steps given above, will still be incapable of actually understanding the meaning of the text. At some
point, your machine should be able to identify synonyms, antonyms, etc. on its own.
Text Encoding::
The data is collected in many languages. With so many languages in the world and internet being
accessed by many countries, there is a lot of text in non-English languages. To work with non-English
text, we need to understand how all the other characters are stored. Computers could handle numbers
directly and store them on registers. But they couldn’t store the non-numeric characters as is. The
alphabets and special characters were to be converted to a numeric value first before they could be
stored. Hence, the concept of encoding came into existence. All the non-numeric characters were
encoded to a number using a code. Also, the encoding techniques had to be standardized so that
different computer manufacturers won’t use different encoding techniques.
The first encoding standard that came into existence was the ASCII (American Standard Code for Information Interchange) standard. ASCII standard assigned a unique code to each character of the keyboard which was known as ASCII code. Later a new standard has come that is called - Unicode standard. It supports all the languages in the world .Suppose you are working on text processing, knowing how to handle encodings becomes is very important. So for any text processing, we need to know what kind of encoding the text has and if required, modify it to another encoding format. there are two most popular encoding standards: 1.American Standard Code for Information Interchange (ASCII), 2. Unicode :a. UTF-8 b. UTF-16
No comments:
Post a Comment