文本处理以从字符串中获取条件类型

时间:2019-05-07 05:23:33

标签: regex python-3.x text-parsing python-textprocessing

首先,对于奇怪的问题标题,我感到抱歉。无法用一行表达出来。

因此,问题陈述是

如果给我以下字符串-

"('James Gosling'/jamesgosling/james gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"

我必须将其解析为

list1 = ["'James Gosling'", 'jamesgosling', 'jame gosling']

list2 = ["'SUN Microsystem'", 'sunmicrosystem']

list3 = [ list1, list2, keyword]

因此,如果我输入James Gosling Sun Microsystem keyword,它应该告诉我我输入的内容是 100%正确

如果我输入J Gosling Sun Microsystem keyword,应该说我只有 66.66%正确。

这是我到目前为止尝试过的。

import re

def main():
    print("starting")
    sentence = "('James Gosling'/jamesgosling/jame gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
    splited = sentence.split(",")
    number_of_primary_keywords = len(splited)
    #print(number_of_primary_keywords, "primary keywords length")
    number_of_brackets = 0
    inside_quotes = ''
    inside_quotes_1 = ''
    inside_brackets = ''
    for n in range(len(splited)):
        #print(len(re.findall('\w+', splited[n])), "length of splitted")
        inside_brackets = splited[n][splited[n].find("(") + 1: splited[n].find(")")]
        synonyms = inside_brackets.split("/")
        for x in range(len(synonyms)):
            try:
                inside_quotes_1 = synonyms[x][synonyms[x].find("\"") + 1: synonyms[n].find("\"")]
                print(inside_quotes_1)
            except:
                pass
            try:
                inside_quotes = synonyms[x][synonyms[x].find("'") + 1: synonyms[n].find("'")]
                print(inside_quotes)
            except:
                pass
            #print(synonyms[x])
        number_of_brackets += 1

    print(number_of_brackets)


if __name__ == '__main__':
    main()

输出如下

'James Gosling

jamesgoslin

jame goslin

'SUN Microsystem
SUN Microsystem
sunmicrosyste
sunmicrosyste
3

如您所见,某些单词的最后一个字母丢失了。

所以,如果您读了这么多,希望您能帮助我获得预期的输出结果

1 个答案:

答案 0 :(得分:0)

不幸的是,您的代码有一个逻辑问题,我无法弄清楚,但是可能存在以下几行:

inside_quotes_1 = synonyms[x][synonyms[x].find("\"") + 1: synonyms[n].find("\"")]

inside_quotes = synonyms[x][synonyms[x].find("'") + 1: synonyms[n].find("'")]

您可以简单地使用它:

inside_quotes_1 = synonyms[x][synonyms[x].find("\x22") + 1: synonyms[n].find("\x22")]

inside_quotes = synonyms[x][synonyms[x].find("\x27") + 1: synonyms[n].find("\x27")]

除此之外,您似乎想提取带有索引的单词,然后可以使用基本的expression提取单词:

(\w+)

然后,您可能想找到一种简单的方法来找到索引,即单词所在的位置。然后,将每个单词与所需的索引相关联。

enter image description here

示例测试

# -*- coding: UTF-8 -*-
import re

string = "('James Gosling'/jamesgosling/james gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
expression = r'(\w+)'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match  ")
else: 
    print(' Sorry! No matches! Something is not right! Call 911 ')