Question

我正在尝试编写一个可以读取同一文件夹中多个 CSV 文件的代码，然后将全文列中的句子标记化，并最终将它们保存到不同的 txt 文件中以供进一步处理。< /p>

我写了一个代码做类似的事情，但它每次只能读取一个txt文件，处理它并输出它。我想扩大规模，以便节省时间。代码如下：

-*- coding: utf-8 -*-
import jieba
import logging
from opencc import OpenCC
cc = OpenCC('s2t')

def main():

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# load stopwords set
stopword_set = set()
with open('/Users/PycharmProjects/pythonProject/gensim/stop.txt', 'r', encoding='utf-8') as stopwords:
    for stopword in stopwords:
        stopword_set.add(stopword.strip('\n'))
        
# tokenize the chinese sentences
output = open('cleantext.txt', 'w', encoding='utf-8')
with open('/Users/PycharmProjects/pythonProject/raw material/e10.txt', 'r', encoding='utf-8') as content :
    for texts_num, line in enumerate(content):
        words = line
        for word in words:
            word = cc.convert(word)
            if word not in stopword_set:
                output.write(word)
        output.write('\n')

        if (texts_num + 1) % 10000 == 0:
            logging.info("finihsed %d lines of sentense" % (texts_num + 1))
output.close()

if __name__ == '__main__':
    main()

CSV 包含几列：rowid、reason、region、fulltext。但标记化过程只需要全文。

很想看看你们是否有一些很棒的想法要分享，非常感谢！度过美好的一天。

在python中读取多个csv文件并输出txt文件

0 个答案: