我有一个涉及大量文本数据的机器学习任务。我想在训练文本中识别和提取名词短语,以便稍后在管道中使用它们进行特征构造。 我从文本中提取了我想要的名词短语的类型,但我对NLTK很新,所以我解决了这个问题,我可以分解列表理解中的每一步,如下所示。
但我真正的问题是,我在这里重新发明轮子吗?有没有更快的方法来做到这一点,我没有看到?
import nltk
import pandas as pd
myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)
tokens = [nltk.word_tokenize(i) for i in texts]
tag_list = [nltk.pos_tag(w) for w in tokens]
phrases = [chunkr.parse(sublist) for sublist in tag_list]
leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]
将我们最终得到的元组列表列表展平为 只是一个元组列表列表
leaves = [tupls for sublists in leaves for tupls in sublists]
将提取的术语加入一个二元组
nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves]
答案 0 :(得分:4)
看看Why is my NLTK function slow when processing the DataFrame?,如果您不需要中间步骤,则无需多次遍历所有行。
使用ne_chunk
和来自
[代码]:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks((sent)))
[OUT]:
0 [New York]
1 [Washington, Bruce Wayne]
Name: text, dtype: object
使用自定义RegexpParser
:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))
[OUT]:
0 [bar sentence, New York city]
1 [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object
答案 1 :(得分:1)
我建议参考先前的线程: Extracting all Nouns from a text file using nltk
他们建议使用TextBlob作为实现此目的的最简单方法(如果不是在处理方面最有效的方法),那么那里的讨论将解决您的问题。
from textblob import TextBlob
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
blob = TextBlob(txt)
print(blob.noun_phrases)
答案 2 :(得分:0)
上述方法没有给我所需的结果。以下是我建议的功能
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import re
def get_noun_phrases(text):
pos = pos_tag(word_tokenize(text))
count = 0
half_chunk = ""
for word, tag in pos:
if re.match(r"NN.*", tag):
count+=1
if count>=1:
half_chunk = half_chunk + word + " "
else:
half_chunk = half_chunk+"---"
count = 0
half_chunk = re.sub(r"-+","?",half_chunk).split("?")
half_chunk = [x.strip() for x in half_chunk if x!=""]
return half_chunk