Question

我正在尝试迭代，并用法语计算文本文件中出现的单词（包含加重的字符）。以下代码选择所有单词，但不考虑强调字符：

#!/usr/bin/env python
# -*- coding: utf-8 -*-   import re

wordcount={}

f = open("verbatim2.txt", "r") regex = re.compile(r'\b\w{4,}\b')
#regex = re.compile(r'[A-Z]\p{L}+\s*')

for line in f:
    words = regex.findall(line)
    for word in words:
        print word
        if word not in wordcount:
            wordcount[word] = 1
        else:
            wordcount[word] += 1 for k,v in wordcount.items():
    print k, v

如何在我的“wordcount”词典中正确包含突出的字符？

谢谢！

Answer 1

在不使用正则表达式的情况下计算/累计/汇总四个或更多字符的单词：

import collections
d = collections.counter()

with open('file') as f
    for line in f:
        line = line.strip()
        line = line.split()
        words = (word for word in line if len(word) >= 4)
        d.update(words)

来自\w的

如果未指定LOCALE和UNICODE标志，则匹配any 字母数字字符和下划线;这相当于设置[a-zA-Z0-9_]。使用LOCALE，它将匹配设置[0-9_]加上任何字符都被定义为当前的字母数字语言环境。如果设置了UNICODE，则将匹配字符[0-9_] plus 在Unicode字符中被分类为字母数字的任何内容属性数据库。

如果您想坚持使用正则表达式，请添加flags = re.UNICODE。

Answer 2

尽可能地使用您的代码（修复语法和使用错误），我得到了这个。如上所述，这已在Python + Regex + UTF-8 doesn't recognize accents

中得到解答

#!/usr/bin/env python
# -*- coding: utf-8 -*-   
import re

wordcount={}

f = open("verbatim2.txt", "r")
regex = r'\b\w{4,}\b'
#regex = re.compile(r'[A-Z]\p{L}+\s*')

for line in f:
    words = re.findall(regex, line.decode('utf8'), re.UNICODE)
    for word in words:
        print word
        if word not in wordcount:
            wordcount[word] = 1
        else:
            for k,v in wordcount.items():
                wordcount[word] += 1
print wordcount

解析带有突出字符的文本文件[Python]

2 个答案: