nltk处理文本

nltk(Natural Language Toolkit)是处理文本的利器。

安装

pip install nltk

进入python命令行，键入nltk.download()可以下载nltk需要的语料库等等。

分词

按词语分割（传入句子）

sentence='hello,world!'

tokens=nltk.word_tokenize(sentence)

tokens就是一个分割好的词表，如下：

['hello', ',', 'world', '!']

按句子分割（传入多个句子组成的文档）

text='This is a text. I want to split it.'

sens=nltk.sent_tokenize(text)

sens就是分割好的句子组成的list,如下：

['This is a text.', 'I want to split it.']

词性标注

tags = [nltk.pos_tag(tokens) for tokens in words]

[[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('text', 'NN'), ('for', 'IN'), ('test', 'NN'), ('.', '.')], [('And', 'CC'), ('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('learn', 'VB'), ('how', 'WRB'), ('to', 'TO'), ('use', 'VB'), ('nltk', 'NN'), ('.', '.')]]

附录：nltk的词性：

 CC      Coordinating conjunction 连接词

```
CD     Cardinal number  基数词
```

DT     Determiner  限定词（如this,that,these,those,such，不定限定词：no,some,any,each,every,enough,either,neither,all,both,half,several,many,much,(a) few,(a) little,other,another.

```
EX     Existential there 存在句
```
```
FW     Foreign word 外来词
```

IN     Preposition or subordinating conjunction 介词或从属连词

```
JJ     Adjective 形容词或序数词
```

JJR     Adjective, comparative 形容词比较级

JJS     Adjective, superlative 形容词最高级

```
LS     List item marker 列表标示
```
```
MD     Modal 情态助动词
```

NN     Noun, singular or mass 常用名词 单数形式

NNS     Noun, plural  常用名词 复数形式

NNP     Proper noun, singular  专有名词，单数形式

NNPS     Proper noun, plural  专有名词，复数形式

```
PDT     Predeterminer 前位限定词
```

POS     Possessive ending 所有格结束词

```
PRP     Personal pronoun 人称代词
```

PRP$     Possessive pronoun 所有格代名词

```
RB     Adverb 副词
```

RBR     Adverb, comparative 副词比较级

RBS     Adverb, superlative 副词最高级

```
RP     Particle 小品词
```
```
SYM     Symbol 符号
```

TO     to 作为介词或不定式格式

```
UH     Interjection 感叹词
```

VB     Verb, base form 动词基本形式

VBD     Verb, past tense 动词过去式

VBG     Verb, gerund or present participle 动名词和现在分词

VBN     Verb, past participle 过去分词

VBP     Verb, non-3rd person singular present 动词非第三人称单数

VBZ     Verb, 3rd person singular present 动词第三人称单数

WDT     Wh-determiner 限定词（如关系限定词：whose,which.疑问限定词：what,which,whose.）

WP      Wh-pronoun 代词（who whose which）

WP$     Possessive wh-pronoun 所有格代词

WRB     Wh-adverb   疑问代词（how where when）

提取关键词

如何对一段话提取关键词呢？主要思想就是先分词，再标词性。

# -*- coding=UTF-8 -*-

import nltk

from nltk.corpus import brown

from nltk.stem import SnowballStemmer

from nltk.corpus import stopwords

# This is our fast Part of Speech tagger

#############################################################################

brown_train = brown.tagged_sents(categories='news')

regexp_tagger = nltk.RegexpTagger(

    [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),

    (r'(-|:|;)$', ':'),

    (r'\'*$', 'MD'),

    (r'(The|the|A|a|An|an)$', 'AT'),

    (r'.*able$', 'JJ'),

    (r'^[A-Z].*$', 'NNP'),

    (r'.*ness$', 'NN'),

    (r'.*ly$', 'RB'),

    (r'.*s$', 'NNS'),

    (r'.*ing$', 'VBG'),

    (r'.*ed$', 'VBD'),

    (r'.*', 'NN')

])

unigram_tagger = nltk.UnigramTagger(brown_train, backoff=regexp_tagger)

bigram_tagger = nltk.BigramTagger(brown_train, backoff=unigram_tagger)

#############################################################################

# This is our semi-CFG; Extend it according to your own needs

#############################################################################

cfg = {}

cfg["NNP+NNP"] = "NNP"

cfg["NN+NN"] = "NNI"

cfg["NNI+NN"] = "NNI"

cfg["JJ+JJ"] = "JJ"

cfg["JJ+NN"] = "NNI"

#############################################################################

class NPExtractor(object):

    # Split the sentence into singlw words/tokens

    def tokenize_sentence(self, sentence):

        tokens = nltk.word_tokenize(sentence)

        #去除停用词,标点，数字,长度小于2的词

        tokens=[w.lower() for w in tokens if(w.isalpha())&(len(w)>1)]#使用tfid，不必去除停用词

        #词干提取

        stemmer=SnowballStemmer('english')

        tokens=[stemmer.stem(w) for w in tokens]

        return tokens

    # Normalize brown corpus' tags ("NN", "NN-PL", "NNS" > "NN")

    def normalize_tags(self, tagged):

        n_tagged = []

        for t in tagged:

            if t[1] == "NP-TL" or t[1] == "NP":

                n_tagged.append((t[0], "NNP"))

                continue

            if t[1].endswith("-TL"):

                n_tagged.append((t[0], t[1][:-3]))

                continue

            if t[1].endswith("S"):

                n_tagged.append((t[0], t[1][:-1]))

                continue

            n_tagged.append((t[0], t[1]))

        return n_tagged

    # Extract the main topics from the sentence

    def extract(self,sentence):

        tokens = self.tokenize_sentence(sentence)

        tags = self.normalize_tags(bigram_tagger.tag(tokens))

        merge = True

        while merge:

            merge = False

            for x in range(0, len(tags) - 1):

                t1 = tags[x]

                t2 = tags[x + 1]

                key = "%s+%s" % (t1[1], t2[1])

                value = cfg.get(key, '')

                if value:

                    merge = True

                    tags.pop(x)

                    tags.pop(x)

                    match = "%s %s" % (t1[0], t2[0])

                    pos = value

                    tags.insert(x, (match, pos))

                    break

        matches = []

        for t in tags:

            if t[1] == "NNP" or t[1] == "NNI" or t[1]=="NN":

                matches.append(t[0])

        return matches

利用这里的extract函数就可以提取文本的关键词。

更多参见nltk官方文档：nltk

巴特西

nltk处理文本

安装

分词

词性标注

提取关键词

最新文章

热门文章