我正在研究python中的文本挖掘用例。这些是感兴趣的句子:
结果可能会继续受到外币汇率波动的不利影响。某些事件,例如威胁要对中国进口的进口商品加征关税,已经有所增加。商店主要位于购物中心和其他购物中心。
如何提取关键字“中国”的句子?在此之前和之后,我确实需要一个句子,实际上在前后至少需要两个句子。
我已经尝试过以下方法,here的回答是:
import nltk
from nltk.tokenize import word_tokenize
sents = nltk.sent_tokenize(text)
my_sentences = [sent for sent in sents if 'China' in word_tokenize(sent)]
请帮助!
答案 0 :(得分:1)
使用$ns = @{dns = 'http://schemas.microsoft.com/developer/msbuild/2003'}
Select-Xml -Xml $xmlNew -XPath '//dns:Token' -Namespace $ns
,跟踪焦点词所在的索引和窗口句子以获取所需的结果。
sent_tokenize
另一个示例,首先from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
word_detokenize = TreebankWordDetokenizer().detokenize
text = """As a result may continue to be adversely impacted, by fluctuations in foreign currency exchange rates. Certain events such as the threat of additional tariffs on imported consumer goods from China, have increased global economic and political uncertainty and caused volatility in foreign currency exchange rates. Stores are primarily located in shopping malls and other shopping centers, certain of which have been experiencing declines in customer traffic."""
tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
sent_idx_with_china = [idx for idx, sent in enumerate(tokenized_text)
if 'China' in sent or 'china' in sent]
window = 2 # If you want 2 sentences before and after.
for idx in sent_idx_with_china:
start = max(idx - window, 0)
end = min(idx+window, len(tokenized_text))
result = ' '.join(word_detokenize(sent) for sent in tokenized_text[start:end])
print(result)
:
pip install wikipedia
[输出]:
设置小熊维尼故事的英格兰的阿什当森林很受欢迎 旅游景点,包括木制的维尼桥,维尼和 小猪发明了维尼熊。牛津大学小熊维尼 该协会由本科生于1982年创立。==中国的审查制度 ==在中国,互联网模因比较时,2017年中从社交媒体网站审查了小熊维尼的图像 中国国家主席习近平到维尼大受欢迎。 2018年电影 克里斯托弗·罗宾(Christopher Robin)也被拒绝在中国发布。