Question

我有一个简单的单词计数器，只有一个例外。它正在分裂\ n字符。

小样本文本文件是：

'''
A tree is a woody perennial plant,typically with branches.
I added this second line,just to add eleven more words.
'''

第1行有10个单词，第2行有11个单词。总字数= 21。

此代码产生的计数为22，因为它包含第1行末尾的\ n字符：

import re


testfile = "d:\\python\\workbook\\words2.txt"

number_of_words = 0

with open(testfile, "r") as datafile:
    for line in datafile:
        number_of_words += len(re.split(",|\s", line))

print(number_of_words)

如果我将我的正则表达式更改为：number_of_words + = len（re.split（“，| ^ \ n | \ s”，行））字数（22）保持不变。

我的问题是：为什么排除换行符[^ \ n]失败，或更广泛地说，是什么应该是正确编码我的正则表达式的正确方法，以便我排除尾随\ n并使上面的代码到达正确的单词总数为21。

Answer 1

您可以简单地使用：

number_of_words = 0
with open(testfile, "r") as datafile:
    for line in datafile:
        number_of_words += len(re.findall('\w+', line)

正则表达式不包括换行符

1 个答案: