What is Tokenization in NLP?

Tokenization

 Do you heard this term before?

 Are you familiar with this term?

 If not, do not worry, I’ll explain this in an easy way. Suppose I’ve one document. In document, I can have multiple paragraph, sentences and words. For simplicity, suppose I’ve only one paragraph. Now i want to break this paragraph into words. Each of these words or units are called tokens. And the process is called tokenization.

That is, in tokenization, we break each sentence into word for this we use nltk.word_tokenize() In the same manner if we want to break our paragraph into sentence and we want to know the length of our sentence than for sentence tokenization we can use nltk.sent_tokenize() and for length we can use len() function.

Popular Posts

Spread the knowledge
 
  

Leave a Reply

Your email address will not be published. Required fields are marked *