What is Tokenization in NLP?


 Do you heard this term before?

 Are you familiar with this term?

 If not, do not worry, I’ll explain this in an easy way. Suppose I’ve one document. In document, I can have multiple paragraph, sentences and words. For simplicity, suppose I’ve only one paragraph. Now i want to break this paragraph into words. Each of these words or units are called tokens. And the process is called tokenization.

That is, in tokenization, we break each sentence into word for this we use nltk.word_tokenize() In the same manner if we want to break our paragraph into sentence and we want to know the length of our sentence than for sentence tokenization we can use nltk.sent_tokenize() and for length we can use len() function.

