Question

我有一个程序可以显示文本中的单词频率列表（标记化文本），但是我想要第一：检测文本的专有名词并将其附加在另一个列表中（Cap_nouns）第二：将不在词典中的名词附加到另一个列表中（错误），

稍后，我想为发现的这些错误创建一个频率列表，并为发现的专有名词创建另一个频率列表。

我要检测专有名词的想法是找到以大写字母开头的项目并将其附加在此列表中，但是看来我执行此任务的正则表达式不起作用。

有人可以帮我吗？我的代码在下面。

from collections import defaultdict
import re
import nltk
from nltk.tokenize import word_tokenize



with open('fr-text.txt') as f:
    freq = word_tokenize(f.read())

with open ('Fr-dictionary_Upper_Low.txt') as fr:
    dic = word_tokenize(fr.read())


#regular expression to detect words with apostrophes and separated by hyphens    
pat=re.compile(r".,:;?!-'%|\b(\w'|w’)+\b|\w+(?:-\w+)+|\d+") 
reg= list(filter(pat.match, freq))
#regular expression for words that start with a capital letter
patt=re.compile(r"\b^A-Z\b")  
c_n= list(filter(patt.match, freq))

d=defaultdict(int)

#Empty list to append the items not found in the dictionary
errors=[ ]
Cnouns=[ ] #Empty list to append the items starting with a capital letter


for w in freq:
    d[w]+=1
    if w in reg:
        continue
    elif w in c_n:
        Cnouns.append(w)
    elif w not in dic:
        errors.append(w)



for w in sorted(d, key=d.get):
    print(w, d[w])


print(errors)
print(Cnouns)

如果我的代码还有其他问题，请告诉我。

Answer 1

至于正则表达式部分，您的模式“有些偏离”。通常，您会错过 character class （字符类）的概念，[abc]类似于与类中定义的集合中的单个char匹配的模式。

正则表达式可检测带有撇号并用连字符分隔的单词：

pat=re.compile(r"(?:\w+['’])?\w+(?:-(?:\w+['’])?\w+)*")

请参见regex demo。但是，它也会匹配常规数字或简单的单词。为了避免匹配它们，您可以使用

pat=re.compile(r"(?:\w+['’])?\w+(?:-(?:\w+['’])?\w+)+|\w+['’]\w+")

请参见this regex demo。

详细信息

(?:\w+['’])?-一个可选的非捕获组，匹配1个或0个出现的1个以上的字符字符，后跟'或’
\w+-1个或多个单词字符
(?:-(?:\w+['’])?\w+)*-重复0次或更多次
- -(?:\w+['’])?-一个可选的非捕获组，匹配1个或0个出现的1个以上的字符字符，后跟'或’
- \w+-1个或多个单词字符

接下来，reg = list(filter(pat.match, freq))可能无法像re.match only matches at the start of the string那样满足您的需求。您最可能想使用re.match：

reg = list(filter(pat.search, freq))
                      ^^^^^^

以大写字母开头的单词的正则表达式可以写为

patt=re.compile(r"\b[A-Z][a-z]*\b")  
c_n= list(filter(patt.search, freq))

请参见this regex demo

\b匹配单词边界，[A-Z]匹配任何大写ASCII字母，[a-z]*部分匹配0个或更多个小写ASCII字母，并且\b确保存在在它们之后的单词边界。

正则表达式可检测列表中的专有名词

1 个答案: