在python中跳过未知数量的单词

时间:2015-06-29 13:19:41

标签: python regex text-mining

所以我通常只是提取短语并在我在文档上运行脚本后以预先指定的格式打印出来。

我用这段代码来分割我的句子

def iterphrases(text):
    return re.split(r'\.\s', re.sub(r'\.\s*$', '', text))

然后我读取文件,如果文件在文件中,我将句子附加到字典中。

def find_keywords(OutputFile, keys):
    phrase_combos= keys + [x.upper() for x in keys] + [x.lower() for x in keys] + [x.capitalize() for x in keys] 
   keys = list(set(phrase_combos))
    cwd = os.getcwd()
print 'Working in current directory : ', cwd
cwdfiles = os.listdir(cwd)

    filenames = []
    for item in cwdfiles:
        if item[-4:] == '.txt':
        filenames.append(item)

    out = defaultdict(list) 
    for filename in filenames:
        for phrase in iterphrases(open(filename).read()):
             for keyword in keys:

                if phrase.lower().index('no') < phrase.index(keyword): 
                    out[keyword].append((filename, phrase))
    my_dict= dict(**out)

我做了一些这方面的工作并且它已经工作了很长一段时间但现在我需要找到不属于某些东西的东西。我可以找到许多短语但有些跳过单词,如果我的短语是单词foo,则不会完全匹配。

没有。不是foo。不是foo或bar。没有foo也没有酒吧。都在我的字典中,但我也需要:

Not bar or foo. Not bar or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, foo, or banana. 

对所有人来说也是一个结果。现在它无法匹配,因为bar foo不在一个否定词旁边。有没有一种方法可以说'匹配,如果出现负面词,不管在感兴趣的词/短语之间有多少其他词,只要你在同一个句子'?

例如创建这样的东西。

This is a group of Text. There is no foo. There is no bar. There is no foo 
or bar. There is no bar or foo. I have coffee. I have a bar. No bar for you. 

应该回来:      {'酒吧':没有酒吧。 ,没有酒吧或foo。 ,没有foo或bar。,没有吧。}

1 个答案:

答案 0 :(得分:1)

尝试使用正则表达式进行搜索。您可以搜索关键字列表,并使用否定列表取消它们。 诀窍是编译一个正则表达式,在你的句子里面搜索'在我的关键字之前的某个地方的否定词'。这意味着:

re.compile(r'\b{!s}\b.+\b{!s}\b'.format(neg, keyword), re.I)

\b表示'字边界'。所以这是一个单词,然后是乱码(.+)后跟一个单词。使用format我们将单词设置为否定词和关键字。 re.I设置ignore-cases-flag。

现在有了你所有的例子和一些例子,我认为你不想像'Nonono这不是正确的foo'或'Anonymus foo ......'那样匹配,我想出了以下内容,它应该给你一个起点:

import re
text = 'Not foo. Not No foo. Not foo or bar. No foo and no bar. Not bar or foo. Not bar or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, foo, or banana. This is a group of Text. There is no foo. There is no bar. There is no foo or bar. There is no bar or foo. I have coffee. I have a bar. No bar for you. Nonono, this is the wrong foo. Nono this is also a wrong foo. Anonymous foo.'
keywords = ['foo']
negated = ['no', 'not']

phraselist = re.split(r'\.\s', text)

out = {}

for phrase in phraselist:
    for keyword in keywords:
        for neg in negated:
            regex = re.compile(r'\b{!s}\b.+\b{!s}\b'.format(neg, keyword), re.I)
            if regex.search(phrase.lower()):
                try:
                    if not phrase in out[keyword]:
                        out[keyword].append(phrase) 
                except KeyError:
                    out[keyword] = [phrase]

print(out)

expected = 'Not foo. Not No foo. Not foo or bar. No foo and no bar. Not bar or foo. Not bar or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, foo or banana. There is no foo. There is no foor or bar. There is no bar or foo.'
print(expected)

输出结果为:

{'foo': ['Not foo', 'Not No foo', 'Not foo or bar', 'No foo and no bar', 'Not ba
r or foo', 'Not bar or foo or banana', 'Not bar or banana or foo', 'Not bar, ban
ana, or foo', 'Not bar, foo, or banana', 'There is no foo', 'There is no foo or
bar', 'There is no bar or foo']}
Not foo. Not No foo. Not foo or bar. No foo and no bar. Not bar or foo. Not bar
or foo or banana. Not bar or banana or foo. Not bar, banana, or foo. Not bar, fo
o or banana. There is no foo. There is no foor or bar. There is no bar or foo.