我正在尝试编写一个可以读取同一文件夹中多个 CSV 文件的代码,然后将全文列中的句子标记化,并最终将它们保存到不同的 txt 文件中以供进一步处理。< /p>
我写了一个代码做类似的事情,但它每次只能读取一个txt文件,处理它并输出它。我想扩大规模,以便节省时间。代码如下:
-*- coding: utf-8 -*-
import jieba
import logging
from opencc import OpenCC
cc = OpenCC('s2t')
def main():
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# load stopwords set
stopword_set = set()
with open('/Users/PycharmProjects/pythonProject/gensim/stop.txt', 'r', encoding='utf-8') as stopwords:
for stopword in stopwords:
stopword_set.add(stopword.strip('\n'))
# tokenize the chinese sentences
output = open('cleantext.txt', 'w', encoding='utf-8')
with open('/Users/PycharmProjects/pythonProject/raw material/e10.txt', 'r', encoding='utf-8') as content :
for texts_num, line in enumerate(content):
words = line
for word in words:
word = cc.convert(word)
if word not in stopword_set:
output.write(word)
output.write('\n')
if (texts_num + 1) % 10000 == 0:
logging.info("finihsed %d lines of sentense" % (texts_num + 1))
output.close()
if __name__ == '__main__':
main()
CSV 包含几列:rowid、reason、region、fulltext。 但标记化过程只需要全文。
很想看看你们是否有一些很棒的想法要分享,非常感谢!度过美好的一天。