如何编写正则表达式以匹配文本文件中的代词?

时间:2019-05-29 11:40:35

标签: python regex

我正在尝试编写一个程序来计算代词/专有名词比率。

我试图用大写字母查找以大写字母开头的名词,以匹配专有名词和代词。但是,我的RE匹配代词效果不佳,因为它不仅匹配代词,而且匹配包含代词字符的单词。参见下面的代码:

def pron_propn():

    while True:
        try:
            file_to_open =Path(input("\nPlease, insert your file path: "))
            dic_to_open=Path(input('\nPlease, insert your dictionary path: '))
            with open(file_to_open,'r', encoding="utf-8") as f:
                words = wordpunct_tokenize(f.read())
            with open(dic_to_open,'r', encoding="utf-8") as d:
                dic = wordpunct_tokenize(d.read())
                break         
        except FileNotFoundError:
            print("\nFile not found. Better try again")


    patt=re.compile(r"^[A-Z][a-z]+\b|^[A-Z]+\b")
    c_n= list(filter(patt.match, words))

    patt2=re.compile(r"\bhe|she|it+\b")
    pronouns= list(filter(patt2.match, words))


    propn_new=[]
    propn=[]
    other=[]
    pron=[] 

    for i in words:
        if i in c_n:
            propn.append(i)
        elif i in pronouns:
            pron.append(i)

        else:
            continue

    for j in propn:
        if j not in dic:
           propn_new.append(j)   
        else:
            other.append(j)


    print(propn_new)
    print(pron)
    print(len(pron)/len(propn))


pron_propn() 

当我打印代词列表时,会得到:['he','he','he','he','hearing','he','it','hear','it' ,“他”,“它”]

但是我想要一个列表,例如:['he','he','he','he','he','it','it','he','it']

我也想得到除法的结果:代名词的数量是专有名词的数量

任何人都只能帮助捕获代词吗?

1 个答案:

答案 0 :(得分:0)

我们可以有一个带有单词边界的捕获组,并向其添加所需的代词,其表达式类似于:

(\b(s?he|it)\b)

如果我们愿意,我们可以添加更多约束。

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(\b(s?he|it)\b)"

test_str = "Anything she wish before it. Anything he wish after it. Then, we repeat. Anything she wish before it. Anything he wish after it. Then, we repeat. Anything she wish before it. Anything he wish after it. Then, we repeat. Anything she wish before it. Anything he wish after it. Then, we repeat. Anything she wish before it. Anything he wish after it. Then, we repeat. "

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

然后,我们可以编写其余部分的脚本,并对代词计数,对所有单词计数,然后我们将这些单词简单地相除即可得出比率。

DEMO

RegEx电路

jex.im可视化正则表达式:

enter image description here