Question

我正在从事自然语言处理，需要预处理一些数据。我的数据在文本文件中，我必须读取数据并将所有名称更改为男性或女性。

在阅读数据并对其进行标记后，我应用了pos标记并检查了一个包含名称列表的文件，并将名称更改为＆＃39; Male＆＃39;或者＆＃39;女性

例如：

[＆＃39; Jack＆＃39;，＆＃39;＆＃39;，＆＃39; Jill＆＃39;，＆＃39; Went＆＃39;，＆＃39; up＆＃39;，＆＃39;＆＃39;，＆＃39; hill＆＃39;]

应改为

[＆＃39;男＆＃39;，＆＃39;＆＃39;，＆＃39;女＆＃39;，＆＃39; Went＆＃39;＆＃39; up＆＃39;，＆＃39;＆＃39;，＆＃39; hill＆＃39;]

基于以下POS

[（＆＃39; Jack＆＃39;，＆＃39; NNP＆＃39;），（＆＃39;和＆＃39;，＆＃39; CC＆＃39;），（＆＃39; Jill＆＃39;，＆＃39; NNP＆＃39;），（＆＃39; Went＆＃39;，＆＃39; NNP＆＃39;），（＆＃39; up＆＃39;，＆＃39; IN＆＃39;），（＆＃39;＆＃39;，＆＃39; DT＆＃39;），（＆＃39; hill＆＃39;，＆＃39; NN＆＃39;）]

我的代码如下：

import nltk

text = open('collegegirl.txt').read()

with open('male_names.txt') as f1:
    male = nltk.word_tokenize(f1.read())

with open('female_names.txt') as f2:
    female = nltk.word_tokenize(f2.read())  

data = nltk.pos_tag(nltk.word_tokenize(text))
for word, pos in data:
    if(pos == 'NNP'):
        if word in male:
            word = 'Male'
        if word in female:
            word = 'Female'

上面的代码只是检查单词而不是写任何东西。如何编辑数据中的名称。我是python的新手。提前谢谢。

Answer 1

拆分文本并在for循环中执行：

for i, (word, pos) in enumerate(data):
    if(pos == 'NNP'):
        if word in male:
            data[i] = ('Male', pos)
        if word in female:
            data[i] = ('Female', pos)
array = [text for (text, pos) in data]

更多的python方式：

array = [x if (not pos == "NNP" and not x in male and not x in female) else ("Female" if (x in female) else ( "Male" if (x in male) else x)) for (x, pos) in data]

Answer 2

我个人认为，最好将Spacy用于POS标记，这是更快，更准确的方法。另外，您可以使用其命名实体识别来检查单词是否为PERSON。安装spacy并从此处https://spacy.io/usage/下载en_core_web_lg模型

您的问题可以通过以下方式解决：

import spacy
from functools import reduce

nlp_spacy = spacy.load('en_core_web_lg')

NAMELIST = {'Christiano Ronaldo':'Male', 'Neymar':'Male', 'Messi':'Male', "Sandra":'Female'}

with open("input.txt") as f:
    text = f.read()

doc = nlp_spacy(text)

names_in_text = [(entity.text, NAMELIST[entity.text])  for entity in doc.ents if entity.label_ in ['PERSON'] and entity.text in NAMELIST]
print(names_in_text)       #------- prints [('Christiano Ronaldo', 'Male'), ('Messi', 'Male')]

replaced_text = reduce(lambda x, kv: x.replace(*kv), names_in_text, text)
print(replaced_text)       #------- prints Male scored three. Male scored one. Female is an athlete. I am from US.

nltk：根据POS

2 个答案: