Open In App

Python | Stemming words with NLTK

Last Updated : 15 Apr, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, and “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

Prerequisite: Introduction to Stemming

Some more example of stemming for root word "like" include:

-> "likes"
-> "liked"
-> "likely"
-> "liking"

Errors in Stemming: There are mainly two errors in stemming – Overstemming and Understemming. Overstemming occurs when two words are stemmed from the same root that are of different stems. Under-stemming occurs when two words are stemmed from the same root that is not of different stems.

Applications of stemming are:  

  • Stemming is used in information retrieval systems like search engines.
  • It is used to determine domain vocabularies in domain analysis.

Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same.

Below is the implementation of stemming words using NLTK:

Code #1:  

Python3




# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
 
ps = PorterStemmer()
 
# choose some words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers"]
 
for w in words:
    print(w, " : ", ps.stem(w))


Output: 

program  :  program
programs  :  program
programmer  :  program
programming  :  program
programmers  :  program

Code #2: Stemming words from sentences

Python3




# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
 
ps = PorterStemmer()
 
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)
 
for w in words:
    print(w, " : ", ps.stem(w))


Output : 

Programmers  :  program
program  :  program
with  :  with
programming  :  program
languages  :  language

Code #3: Using reduce():

Algorithm :

  1. Import the necessary modules: PorterStemmer and word_tokenize from nltk, and reduce from functools.
  2. Create an instance of the PorterStemmer class.
  3. Define a sample sentence to be stemmed.
  4. Tokenize the sentence into individual words using word_tokenize.
  5. Use reduce to apply the PorterStemmer to each word in the tokenized sentence, and join the stemmed words back into a string.
  6. Print the stemmed sentence.
install the pip install nltk

Python3




from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from functools import reduce
 
ps = PorterStemmer()
 
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)
 
# using reduce to apply stemmer to each word and join them back into a string
stemmed_sentence = reduce(lambda x, y: x + " " + ps.stem(y), words, "")
 
print(stemmed_sentence)
#This code is contrinuted by Pushpa.


Output:

Programm program with program language

Time complexity:
The time complexity of this code is O(nlogn), where n is the length of the input sentence. The tokenizer and stemmer functions have a linear time complexity of O(n), but the reduce function has a logarithmic time complexity of O(logn) since it processes elements in pairs.

Space complexity:
The space complexity of this code is O(n), where n is the length of the input sentence. This is because the reduce function creates a new string object that has the same length as the input sentence. The tokenizer and stemmer functions do not increase the space complexity significantly.



Similar Reads

Introduction to Stemming
Stemming is a method in text processing that eliminates prefixes and suffixes from words, transforming them into their fundamental or root form, The main objective of stemming is to streamline and standardize words, enhancing the effectiveness of the natural language processing tasks. The article explores more on the stemming technique and how to p
9 min read
Python | Lemmatization with NLTK
Lemmatization is a fundamental text pre-processing technique widely applied in natural language processing (NLP) and machine learning. Serving a purpose akin to stemming, lemmatization seeks to distill words to their foundational forms. In this linguistic refinement, the resultant base word is referred to as a "lemma." The article aims to explore t
6 min read
N-Gram Language Modelling with NLTK
Language modeling is the way of determining the probability of any sequence of words. Language modeling is used in a wide variety of applications such as Speech Recognition, Spam filtering, etc. In fact, language modeling is the key aim behind the implementation of many state-of-the-art Natural Language Processing models. Methods of Language Modeli
6 min read
Bag of words (BoW) model in NLP
In this article, we are going to discuss a Natural Language Processing technique of text modeling known as Bag of Words model. Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, Bag of Words model is used to preprocess the text by converting it into a bag of words, which keeps a
4 min read
NLP | How to score words with Execnet and Redis
Distributed word scoring can be performed using Redis and Execnet together. For each word in movie_reviews corpus, FreqDist and ConditionalFreqDist are used to calculate information gain. Using >RedisHashFreqDist and a RedisConditionalHashFreqDist, same thing can be performed with Redis. The scores are then stored in a RedisOrderedDict. In order
4 min read
Top K Nearest Words using Edit Distances in NLP
We can find the Top K nearest matching words to the given query input word by using the concept of Edit/Levenstein distance. If any word is the same word as the query input(word) then their Edit distance would be zero and is a perfect match and so on. So we can find the Edit distances between the query word and the words present in our vocabulary p
3 min read
Finding the Word Analogy from given words using Word2Vec embeddings
In many placement exam rounds, we often encounter a basic question to find word analogies. In the word analogy task, we complete the sentence "a is to b as c is to ___ ", which is often represented as a : b :: c : d and we have to find the word 'd'. A sample question can be like: 'man is to woman as king is to ___'. The human brain can recognize th
3 min read
Finding the Odd Word amongst given words using Word2Vec embeddings
Odd One out the problem is one of the most interesting and goto problems when it comes to testing the logical reasoning skills of an individual. It is often used in many competitive exams and placement rounds as it checks the individual's analytical skills and decision-making ability. In this article, we are going to write a python code that can be
3 min read
NLP | How tokenizing text, sentence, words works
Tokenization in natural language processing (NLP) is a technique that involves dividing a sentence or phrase into smaller units known as tokens. These tokens can encompass words, dates, punctuation marks, or even fragments of words. The article aims to cover the fundamentals of tokenization, it's types and use case. What is Tokenization in NLP?Natu
8 min read
Python | Sort Python Dictionaries by Key or Value
There are two elements in a Python dictionary-keys and values. You can sort the dictionary by keys, values, or both. In this article, we will discuss the methods of sorting dictionaries by key or value using Python. Need for Sorting Dictionary in PythonWe need sorting of data to reduce the complexity of the data and make queries faster and more eff
5 min read
Article Tags :
Practice Tags :