Question

我正在对中型（1.7 Mb）波斯语文本语料库执行一些处理任务。我想在文本中列出三组字符：

字母
白色空格（包括换行符，制表符，空格，不间断空格等）和
标点符号。我写了这个：

# -*- coding:  utf8 -*-
TextObj = open ('text.txt', 'r', encoding = 'UTF8')
import string
LCh = LSpc = LPunct = []
TotalCh = TotalPunct = TotalSpc = 0
TempSet = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
#TempSet variable holds alphabets of Persian language.
ReadObj = TextObj.read ()
for Char in ReadObj:
    if Char in TempSet: #This's supposed to count & extract alphabets only.
        TotalCh += 1 
        LCh.append (Char)
    elif Char in string.punctuation: #This's supposed to count puncts.
        TotalPunct += 1
        LPunct.append (Char)
    elif Char in ('', '\n', '\t'): #This counts & extracts spacey things.
        TotalSpc += 1
        LSpc.append (Char)
    else: #This'll ignore anything else.
        continue

但是当我尝试时：

print (LPunct)
print (LSpc)

我在Linux和Windows 7上都试过这个代码。在这两个代码上，结果都不是我预期的结果。标点符号和空格列表都包含波斯语字母。

另一个问题：

如何改善这种情况elif Char in ('', '\n', '\t'):，以便涵盖所有类型的太空家庭？

Answer 1

在第3行，您已将所有列表分配为相同的列表！

不要这样做：

LCh = LSpc = LPunct = []

这样做：

LCh = []
LSpc = []
LPunct = []

string类内置了空格。

elif Char in string.whitespace:
    TotalSpc += 1
    LSpc.append (Char)

在您的示例中，您实际上并未在''字符中添加空格，这也可能导致其失败。这不应该是' '吗？

另外，考虑到其他答案，此代码不是非常pythonic。

我这样写：

# -*- coding:  utf8 -*-
import fileinput
import string
persian_chars = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
filename = 'text.txt'
persian_list = []
punctuation_list = []
whitespace_list = []
ignored_list = []

for line in fileinput.input(filename):
    for ch in line:
        if ch in persian_chars:
            persian_list.append(ch)
        elif ch in string.punctuation:
            punctuation_list.append(ch)
        elif ch in string.whitespace:
            whitespace_list.append(ch)
        else:
            ignored_list.append(ch)

total_persian, total_punctuation, total_whitepsace = \
    map(len, [persian_list, punctuation_list, whitespace_list])

Answer 2

首先，作为处理文件的更加pythonic方式，您最好使用with语句打开将在块结束时关闭文件的文件。

其次，因为您要计算文本中特殊字符的数量并单独保存它们，您可以使用列表名称作为键的列表和列表中的相对字符作为值。然后使用len方法获取长度。

最后，为了检查空格中的成员资格，您可以使用string.whitespace方法。

import string
TempSet = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
result_dict={}
with open ('text.txt', 'r', encoding = 'UTF8') as TextObj :
    ReadObj = TextObj.read ()
    for ch in ReadObj :
       if Char in TempSet:
          result_dict['TempSet'].append(ch)
       elif Char in string.punctuation:
          result_dict['LPunct'].append(ch)
       elif Char in string.whitespace:
          result_dict['LSpc'].append(ch)

TotalCh =len(result_dict['LSpc'])

Python3：从文本中计算和提取字母

2 个答案: