Question

我正在尝试查看csv中是否有特殊字符。此文件包含一列，大约180,000行。由于我的文件包含韩文，英文和中文，我添加了가-힣``A-Z``0-9，但我不知道应该如何过滤中文字母。或者有更好的方法吗？

我要找的特别信件是：■，△，？等

我不想指的是特殊字母：单位(ex : ㎍, ㎥, ℃)，()，'等。

在stackflow上搜索，许多问题考虑指定特殊字母以便首先查找。但在我的情况下，这很难，因为我有180,000条记录，而且我不知道那里有什么字母。就我而言，只有三种语言;韩语，英语和中文。

到目前为止，这是我的代码：

with open("C:/count1.csv",'w',encoding='cp949',newline='') as testfile:        
    csv_writer=csv.writer(testfile)
    with open(file,'r') as fi:
            for line in fi:
                x=not('가-힣','A-Z','0-9')
                if x in line :
                    sub=re.sub(x,'*',line.rstrip())
                count=len(sub)
                lst=[fi]+[count]
                csv_writer.writerow(lst)

使用import re

regex=not'[가-힣]','[a-z]','[0-9]'

file="C:/kd/fields.csv"
with open("C:/specialcharacter.csv",'w',encoding='cp949',newline='') as testfile: 
    csv_writer=csv.writer(testfile)
    with open(file,'r') as fi:
            for line in fi:
                search_target = line
                result=re.findall(regex,search_target)
                print("\n".join(result))

Answer 1

我不知道为什么在你只寻找一些特殊字母时考虑不要过滤中文字符。 This library可以过滤中文。

在过滤后的韩语，英语和号码列表中过滤中文：regex = "[^가-힣a-zA-Z0-9]" result=re.findall(regex,search_target)
过滤1）您寻找的特殊字符列表或2）您要避免的特殊字符列表。

明智地选择哪种更适合您的情况，以避免尽可能多的例外情况，这样您就不必每次都添加更多过滤器。

将列表设为正则表达式。

然后，使用正则表达式遍历您的180,000行以过滤掉行。

更新您的正则表达式列表，直到您过滤所有内容。

过滤特殊字符，计算它们，并重写另一个csv

1 个答案: