Question

我使用scrapy抓取网站以获取数千个.txt文件，每个文件都包含自然语言的文本（药物诱导体验的描述）。每个文件的名称都是唯一的编号。我还有一个.csv文件，其中包含与这些唯一编号中的每一个相关联的元数据（即，我有一个text_number列，以及与此特定编号对应的元数据的其他列）。元数据类别之一是剂量数（以mg计）。

这是我正在尝试做的事情：

查找哪些.txt文件在100个特定上下文单词之一的5个单词（左侧和右侧）中包含特定单词（“self”）的出现（我有一个精确的列表）。
获取第一步中挑出的.txt文件的平均剂量数（来自元数据），以便将其与所有.txt文件的平均剂量数进行比较。

我真的不知道该怎么办......

Answer 1

我认为正则表达式可能是一个很好的解决方案。他们很快，你有很多数据。不知道最好的方法是什么，但这是一个解决方案。

说出你的目标词（'self'），你的上下文词汇列表如下：

target_word = 'self'
context_words = ['one', 'hundred', 'context', 'words']
#mine is much shorter than yours! ;)

然后，您可以创建一个正则表达式，希望单词用空格分隔。我在上下文单词之前使用了一个模式，在上下文单词之后使用了一个模式，然后将它们与or（'|'）组合在一起。不确定是否有必要，只是不能轻易想到另一种方式。

import re
matches_up_to_4_words = '( [^ ]*){0,4} ?'
matches_context_word = '(' + '|'.join(context_words) + ')'
matches_target_word = target_word
context_before = matches_context_word + matches_up_to_4_words + matches_target_word
context_after = matches_target_word + matches_up_to_4_words + matches_context_word
pattern = re.compile('(' + context_before + '|' + context_after + ')')

matching_metadata = []
for filename in filenames:
    filestring = open(filename, 'rb').read()
    ## you can tokenize here for better word segmentation
    ## http://www.nltk.org/api/nltk.tokenize.html 
    if re.search(pattern, filestring):
        print "the target word appeared near a context word"
        ## get the metadata
        metadata = get_the_metadata(filename, filestring)
        matching_metadata.append(metadata)

然后您可以存储可以使用的元数据。

NLTK在语料库中查找上下文单词的5个单词（左/右）内出现的单词

1 个答案: