Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in stopwords.words("english")]) def testFuncNew():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in cachedStopWords]) if __name__ == "__main__":
for i in xrange(10000):

I ran this through the profiler: python -m cProfile -s cumulative The relevant lines are posted below.

nCalls Cumulative Time

10000 7.723

10000 0.140

So, caching the stopwords instance gives a ~70x speedup.


