这是我到目前为止所做的:
import re
import csv
outfile1 = open('test_output.csv', 'wt')
outfileWriter1 = csv.writer(outfile1, delimiter=',')
rawtext = open('rawtext.txt', 'r').read()
print(rawtext)
rawtext = rawtext.lower()
print(rawtext)
re.sub('[^A-Za-z0-9]+', '', rawtext)
print(rawtext)
首先,当我运行此标点时,标点符号不会被删除,所以我想知道我的表达式是否有问题?
其次,我正在尝试制作一个.csv列表,其中包含标记的所有单词是否有标点符号,例如:一个文本文件,上面写着“你好!这是美好的一天。”输出:
ID, PUNCTUATION, WORD
1, Y, hello
2, Y, its
3, N, a
4, N, nice
5, Y, day
我知道我可以使用.split()来分割单词,但除此之外我不知道该如何解决这个问题!任何帮助将不胜感激。
答案 0 :(得分:0)
您可以这样做:
from string import punctuation
import csv
strs = "Hello! It's a nice day."
with open('abc.csv', 'w') as f:
writer = csv.writer(f, delimiter=',')
writer.writerow(['ID', 'PUNCTUATION', 'WORD'])
#use enumerate to get word as well as index
table = dict.fromkeys(map(ord, punctuation))
for i, word in enumerate(strs.split(), 1):
#str.translate is faster than regex
new_strs = word.translate(table)
#if the new word is not equal to original word then use 'Y'
punc = 'Y' if new_strs != word else 'N'
writer.writerow([i, punc, new_strs])
答案 1 :(得分:0)
试试这个版本:
import string
import csv
header = ('ID','PUNCTUATION','WORD')
with open('test_output.csv', 'wt') as outf, open('rawtext.txt') as inf:
outfileWriter1 = csv.DictWriter(outf, header, delimiter=',')
for k, rawtext in enumerate(inf):
out = {'PUNCTUATION': 'N', 'ID': k+1}
for word in rawtext.split():
stripped = ''.join(i for i in word if i not in string.punctuation)
if len(stripped) != len(word):
out['PUNCTUATION'] = 'Y'
out['WORD'] = stripped.lower()
outfileWriter1.writerow(out)