在Python中从文本中删除标点符号

时间:2017-03-02 03:49:48

标签: python string mapreduce nlp special-characters

我正在尝试从文本文件中获取标记(单词)并将其从所有标点符号中删除。我正在尝试以下方法:

import re 

with open('hw.txt') as f:
    lines_after_254 = f.readlines()[254:]
    sent = [word for line in lines_after_254 for word in line.lower().split()]
    words = re.sub('[!#?,.:";]', '', sent)

我收到以下错误:

return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

4 个答案:

答案 0 :(得分:1)

您的列表中没有任何内容

In [14]: with open('data', 'r') as f:
    ...:     l=f.readlines()[254:]
    ...:     

In [15]: l
Out[15]: []

假设你想要一个单词列表,试试这个

with open('data', 'r') as f:
     lines = [line.strip() for line in f]

sent= [w for word in lines[:254] for w in re.split('\s+', word)]

find = '[!#?,.:";]'
replace = ''

words = [re.sub(find, replace, word) for word in sent]

as @Keerthana Prabhakaran指出re.sub已被更正

答案 1 :(得分:1)

re.sub将应用于字符串而不是列表!

print re.sub(pattern, '', sent)

应该是

print [re.sub(pattern, '', s) for s in sent]

希望这有帮助!

答案 2 :(得分:1)

您的脚本中有几件事情。你不是象征性的,而是将所有东西分成单个字符!此外,在将所有内容分成字符后,您将删除特殊字符。

更好的方法是读取输入字符串,删除特殊字符,然后将输入字符串标记化。

import re

# open the input text file and read
string = open('hw.txt').read()
print string

# remove the special charaters from the read string
no_specials_string = re.sub('[!#?,.:";]', '', string)
print no_specials_string

# split the text and store words in a list
words = no_specials_string.split()
print words

或者,如果您想首先拆分为令牌然后删除特殊字符,您可以这样做:

import re

# open the input text file and read
string = open('hw.txt').read()
print string

# split the text and store words in a list
words = string.split()
print words

# remove special characters from each word in words
new_words = [re.sub('[!#?,.:";]', '', word) for word in words]
print new_words

答案 3 :(得分:1)

使用下面的remove_puncts()功能

import string
translator = str.maketrans('', '', string.punctuation)
def remove_puncts(input_string):
    return input_string.translate(translator)

使用示例

input_string = """"YH&W^(*D)#IU*DEO)#brhtr<><}{|_}vrthyb,.,''fehsvhrr;[vrht":"]`~!@#$%svbrxs"""
remove_puncts(input_string)
'YHWDIUDEObrhtrvrthybfehsvhrrvrhtsvbrxs'

编辑

速度比较

使用translator方法比使用正则表达式替换更快

import re, string, time

pattern = '[!#?,.:";]'
def regex_sub(input_string):
    return re.sub(pattern, '', input_string)

translator = str.maketrans('', '', string.punctuation)
def string_translator(input_string):
    return input_string.translate(translator)

input_string = """cwsx#?;.frvcdr"""
string_translator(input_string)
regex_sub(input_string)

passes = 1000000
t1 = time()
for i in range(passes):
    a = string_translator(input_string)

t2 = time()
for i in range(passes):
    a = regex_sub(input_string)

t3 = time()

string_translator_time = t2 - t1
regex_sub_time = t3 - t2

print(string_translator_time) # 1.341651439666748
print(regex_sub_time) # 3.44773268699646