Question

我有一个巨大的文本语料库（逐行），我想删除特殊字符，但维持字符串的空间和结构。

hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.

应该是

hello there A Z R T world welcome to python
this should be the next line followed by another million like this

Answer 1

您也可以使用regex：

来使用此模式

import re
a = '''hello? there A-Z-R_T(,**), world, welcome to python.
this **should? the next line#followed- by@ an#other %million^ %%like $this.'''

for k in a.split("\n"):
    print(re.sub(r"[^a-zA-Z0-9]+", ' ', k))
    # Or:
    # final = " ".join(re.findall(r"[a-zA-Z0-9]+", k))
    # print(final)

输出：

hello there A Z R T world welcome to python 
this should the next line followed by an other million like this

编辑：

否则，您可以将最后一行存储到list：

final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")]
print(final)

输出：

['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']

Answer 2

我认为nfn neil的答案很棒......但我只想添加一个简单的正则表达式来删除所有无字的字符，但是它会将下划线视为单词的一部分

print  re.sub(r'\W+', ' ', string)
>>> hello there A Z R_T world welcome to python

Answer 3

一个更优雅的解决方案是

print(re.sub(r"\W+|_", " ", string))

>>> hello there A Z R T world welcome to python this should the next line followed by another million like this

在这里， re是python中的regex模块

re.sub将用空格替换模式，即" "

r''会将输入字符串视为原始(with \n)

\W用于所有非单词，即所有特殊字符*＆^％$等，下划线_

+将匹配零到无限匹配项，类似于*（一到多个）

|是逻辑或

_代表下划线

Answer 4

创建将特殊字符映射为无

的字典

d = {c:None for c in special_characters}

使用字典制作translation table。将整个文本读入变量并在整个文本上使用str.translate。

Answer 5

您可以尝试

import re
sentance = '''hello? there A-Z-R_T(,**), world, welcome to python. this **should? the next line#followed- by@ an#other %million^ %%like $this.'''
res = re.sub('[!,*)@#%(&$_?.^]', '', sentance)
print(res)

re.sub（'[“]'）->在这里，您可以添加要删除的符号

如何从python中的文件中删除空格以外的特殊字符？

5 个答案: