Jep仍然在玩Python。
我决定尝试使用Gensim,这是一个找出所选单词和主题的主题的工具。上下文。
所以我想知道如何在一段文字中找到一个单词并与之一起提取20个单词(如在该单词之前的10个单词和该单词之后的10个单词)然后将其与其他此类提取一起保存。 Gensim可以在它上面运行。
对我来说似乎很难找到一种方法来在找到选择的单词时提取单词之前和之后的10。之前我和nltk玩过,只是将文本标记为单词或句子,很容易掌握句子。仍然在那个特定句子之前和之后得到那些单词或句子似乎很难弄清楚如何去做。
对于那些感到困惑的人(这是凌晨1点,所以我可能会感到困惑)我会用一个例子来展示:
一旦完成,她的所有鲜血都冲向她的心脏,因为她 气得听到白雪公主还活着。 “但现在,” 她以为自己,“我会做些会毁灭她的东西 完全。“这么说,她用艺术做了一把中毒梳子 理解,然后,伪装自己,她采取了旧的形式 寡妇。她走过七座小山到七个小矮人的家里, [15]敲门,叫出来,“今天卖的好东西!”
如果我们说单词是Snow-White,那么我想要提取这部分:
雪白之前和之后的10个单词。她的心脏,因为她很生气地听到白雪公主还活着。 “但现在,” 以为她自己,“会
如果可以在nltk中完成并且更容易,那么在Snow-White出现的句子之前和之后获得句子也足够酷。
我的意思是无论什么效果最好,如果有人能帮助我,我会对两种解决方案中的一种感到满意。
如果这也可以用Gensim完成......那就更容易了,那么我也会对此感到高兴。因此,这三种方式中的任何一种都可以......我只想尝试看看如何做到这一点,因为我的脑袋是空白的。
答案 0 :(得分:7)
该过程称为Keyword in Context (KWIC)。
第一步是将输入分成单词。使用regular expressions module的方法有很多,例如re.split或re.findall。
找到一个特定的单词后,你可以使用切片来找到之前的十个单词和之后的十个单词。
要为所有单词构建索引,带有maxlen的deque便于实现滑动窗口。
以下是使用itertools高效执行此操作的一种方法:
from re import finditer
from itertools import tee, islice, izip, chain, repeat
def kwic(text, tgtword, width=10):
'Find all occurrences of tgtword and show the surrounding context'
matches = (mo.span() for mo in finditer(r"[A-Za-z\'\-]+", text))
padded = chain(repeat((0,0), width), matches, repeat((-1,-1), width))
t1, t2, t3 = tee((padded), 3)
t2 = islice(t2, width, None)
t3 = islice(t3, 2*width, None)
for (start, _), (i, j), (_, stop) in izip(t1, t2, t3):
if text[i: j] == tgtword:
context = text[start: stop]
yield context
print list(kwic(text, 'Snow-White'))
答案 1 :(得分:6)
text = """
As soon as it had finished, all her blood rushed to her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will I make something which shall destroy her completely." Thus saying, she made a poisoned comb by arts which she understood, and then, disguising herself, she took the form of an old widow. She went over the seven hills to the house of the seven Dwarfs, and[15] knocking at the door, called out, "Good wares to sell to-day!"
"""
spl = text.split()
def ans(word):
for ind, x in enumerate(spl):
if x.strip(",'\".!") == word:
break
return " ".join(spl[ind-10:ind] + spl[ind:ind+11])
>>> ans('Snow-White')
her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will
答案 2 :(得分:0)
只是想为 python 3 更新 Raymond Hettinger's 奇妙的答案:
您所要做的就是将 izip
更改为 zip
from re import finditer
from itertools import chain, islice, repeat, tee
def kwic(text, tgtword, width=20):
'Find all occurrences of tgtword and show the surrounding context'
matches = (mo.span() for mo in finditer(r"[A-Za-z\'\-]+", text))
padded = chain(repeat((0,0), width), matches, repeat((-1,-1), width))
t1, t2, t3 = tee((padded), 3)
t2 = islice(t2, width, None)
t3 = islice(t3, 2*width, None)
for (start, _), (i, j), (_, stop) in zip(t1, t2, t3):
if text[i: j] == tgtword:
context = text[start: stop]
yield context
此外,为了完整起见,NLTK
和 Texacity
都内置了此功能;然而,两者都不如雷蒙德的答案,因为两者都使用字符作为窗口而不是标记。
import nltk
test = """
As soon as it had finished, all her blood rushed to her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will I make something which shall destroy her completely." Thus saying, she made a poisoned comb by arts which she understood, and then, disguising herself, she took the form of an old widow. She went over the seven hills to the house of the seven Dwarfs, and[15] knocking at the door, called out, "Good wares to sell to-day!"
"""
tokens = nltk.word_tokenize(test)
text = nltk.Text(tokens)
text.concordance('Snow-White', width=100)
Displaying 1 of 1 matches:
er heart , for she was so angry to hear that Snow-White was yet living . `` But now , '' thought she
from textacy.text_utils import KWIC
test = """
As soon as it had finished, all her blood rushed to her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will I make something which shall destroy her completely." Thus saying, she made a poisoned comb by arts which she understood, and then, disguising herself, she took the form of an old widow. She went over the seven hills to the house of the seven Dwarfs, and[15] knocking at the door, called out, "Good wares to sell to-day!"
"""
snow_white = KWIC(test, "Snow-White", window_width=50)
print(next(snow_white, ["Finished"]))
d to her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself
['Finished']