Question

我编写了一个遍历数据的脚本，使用正则表达式检查表情符号，当发现表情符号时，计数器会更新。然后，应该将每个类别的计数数量写入列表，例如，cat ne有25个表情符号，类别fr有45个....这里是出错的地方。我得到的结果是：

[1，'ag'，2，'dg'，3，'dg'，4，'fr'，5，'fr'，6，'fr'，7，'fr'，8，'hp '，9，'hp'，10，'hp'，11，'hp'，12，'hp'，13，'hp'，14，'hp'，15，'hp'，16，'hp'， 17，'hp'，18，'hp'，19，'hp'，20，'hp'，21，'hp'，22，'hp'，23，'hp'，24，'hp'，25， 'ne'，26，'ne'，27，'ne'，28，'ne'，29，'ne'，30，'ne'，31，'ne'，32，'ne'，33，'ne '，34，'ne'，35，'ne'，36，'ne'，37，'ne'，38]

fileid是这种形式，一个大文件包含7个较小的文件（每个文件是一个类别）。在类别文件中，每个类别大约有100个文件：

数据/ NE / 567.txt

每个.txt文件中的数据只是一个句子，看起来像这样

我今天很开心：）

这是我的剧本：

counter = 0
lijst = []  
for fileid in corpus.fileids():
    for sentence in corpus.sents(fileid):
        cat = str(fileid.split('/')[0])
        s = " ".join(sentence)    
        m = re.search('(:\)|:\(|:\s|:\D|:\o|:\@)+', s)
        if m is not None:
            counter +=1
            lijst += [counter] + [cat]

Answer 1

你应该这样做：

import collections

counts = collections.defaultdict(lambda: 0)
for fileid in corpus.fileids():
    for sentence in corpus.sents(fileid):
        cat = str(fileid.split('/')[0])
        s = " ".join(sentence)
        counts[cat] += len(re.findall('(:\)|:\(|:\s|:\D|:\o|:\@)+', s))

按类别划分的组数

1 个答案: