使用python和nltk从文本文件中提取候选者的名称

时间:2017-11-30 10:30:48

标签: python nltk

import re
import spacy
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
from nltk.corpus import wordnet

inputfile = open('inputfile.txt', 'r')
String= inputfile.read()
nlp = spacy.load('en_core_web_sm')

def candidate_name_extractor(input_string, nlp):
    input_string = str(input_string)

    doc = nlp(input_string)

    # Extract entities
    doc_entities = doc.ents

    # Subset to person type entities
    doc_persons = filter(lambda x: x.label_ == 'PERSON', doc_entities)
    doc_persons = filter(lambda x: len(x.text.strip().split()) >= 2, doc_persons)
    doc_persons = list(map(lambda x: x.text.strip(), doc_persons))
    print(doc_persons)
    # Assuming that the first Person entity with more than two tokens is the candidate's name
    candidate_name = doc_persons[0]
    return candidate_name

if __name__ == '__main__':
    names = candidate_name_extractor(String, nlp)

print(names)

我想从文本文件中提取候选名称,但它返回错误的值。当我删除带有地图的列表时,地图也不起作用并给出错误

2 个答案:

答案 0 :(得分:2)

import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
from nltk.corpus import wordnet

String = 'Ravana was killed in a war'

Sentences = nltk.sent_tokenize(String)
Tokens = []
for Sent in Sentences:
    Tokens.append(nltk.word_tokenize(Sent)) 
Words_List = [nltk.pos_tag(Token) for Token in Tokens]

Nouns_List = []

for List in Words_List:
    for Word in List:
        if re.match('[NN.*]', Word[1]):
             Nouns_List.append(Word[0])

Names = []
for Nouns in Nouns_List:
    if not wordnet.synsets(Nouns):
        Names.append(Nouns)

print (Names)

检查此代码。我得到Ravana作为输出。

编辑:

我使用简历中的几句话来创建一个文本文件,并将其作为我程序的输入。只有代码的更改部分如下所示:

import io

File = io.open("Documents\\Temp.txt", 'r', encoding = 'utf-8')
String = File.read()
String = re.sub('[/|.|@|%|\d+]', '', String)

它将返回wordnet语料库中没有的所有名称,例如我的姓名,姓名,地点,大学名称和地点。

答案 1 :(得分:0)

从词性标注后获得的单词列表中,使用正则表达式提取所有具有名词标签的单词:

np.random.seed(123)
N = 1000000
L = pd.date_range('2015-01-01', '2018-01-01')

df = pd.DataFrame({'tenant_date': np.random.choice(L, size=N),
                   'tenant_id':np.random.randint(1000,size=N),
                   'tenant_class_id':np.random.randint(1000,size=N)})
print (df)

In [99]: %timeit data_1.sort_values(['tenant_id','tenant_date']).drop_duplicates(['tenant_id'])
1000 loops, best of 3: 1.97 ms per loop

In [100]: %timeit data_1.sort_values(['tenant_id','tenant_date']).groupby(['tenant_id']).head(1)
1000 loops, best of 3: 2.07 ms per loop

In [101]: %timeit data_1.loc[data_1.groupby(['tenant_id'])['tenant_date'].idxmin()]
100 loops, best of 3: 2.04 ms per loop

In [102]: %timeit data_1.set_index(['tenant_class_id']).groupby(['tenant_id'])['tenant_date'].nsmallest(1).reset_index()
100 loops, best of 3: 8.64 ms per loop

In [103]: %timeit data_1.groupby(['tenant_id']).apply(lambda x: x.nsmallest(1, 'tenant_date')).reset_index(drop=1)
100 loops, best of 3: 11.4 ms per loop

对于Nouns_List = [] for Word in nltk.pos_tag(Words_List): if re.match('[NN.*]', Word[1]): Nouns_List.append(Word[0]) 中的每个单词,请检查它是否为英文单词。这可以通过检查Nouns_List中该字词是否可用synsets来完成:

wordnet

由于印度名称不能成为英语词典中的条目,因此这可能是从文本中提取它们的可能方法。