我正在尝试从文本文件中获取标记(单词)并将其从所有标点符号中删除。我正在尝试以下方法:
import re
with open('hw.txt') as f:
lines_after_254 = f.readlines()[254:]
sent = [word for line in lines_after_254 for word in line.lower().split()]
words = re.sub('[!#?,.:";]', '', sent)
我收到以下错误:
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
答案 0 :(得分:1)
您的列表中没有任何内容
In [14]: with open('data', 'r') as f:
...: l=f.readlines()[254:]
...:
In [15]: l
Out[15]: []
假设你想要一个单词列表,试试这个
with open('data', 'r') as f:
lines = [line.strip() for line in f]
sent= [w for word in lines[:254] for w in re.split('\s+', word)]
find = '[!#?,.:";]'
replace = ''
words = [re.sub(find, replace, word) for word in sent]
as @Keerthana Prabhakaran指出re.sub已被更正
答案 1 :(得分:1)
re.sub将应用于字符串而不是列表!
print re.sub(pattern, '', sent)
应该是
print [re.sub(pattern, '', s) for s in sent]
希望这有帮助!
答案 2 :(得分:1)
您的脚本中有几件事情。你不是象征性的,而是将所有东西分成单个字符!此外,在将所有内容分成字符后,您将删除特殊字符。
更好的方法是读取输入字符串,删除特殊字符,然后将输入字符串标记化。
import re
# open the input text file and read
string = open('hw.txt').read()
print string
# remove the special charaters from the read string
no_specials_string = re.sub('[!#?,.:";]', '', string)
print no_specials_string
# split the text and store words in a list
words = no_specials_string.split()
print words
或者,如果您想首先拆分为令牌然后删除特殊字符,您可以这样做:
import re
# open the input text file and read
string = open('hw.txt').read()
print string
# split the text and store words in a list
words = string.split()
print words
# remove special characters from each word in words
new_words = [re.sub('[!#?,.:";]', '', word) for word in words]
print new_words
答案 3 :(得分:1)
使用下面的remove_puncts()
功能
import string
translator = str.maketrans('', '', string.punctuation)
def remove_puncts(input_string):
return input_string.translate(translator)
使用示例
input_string = """"YH&W^(*D)#IU*DEO)#brhtr<><}{|_}vrthyb,.,''fehsvhrr;[vrht":"]`~!@#$%svbrxs"""
remove_puncts(input_string)
'YHWDIUDEObrhtrvrthybfehsvhrrvrhtsvbrxs'
编辑
速度比较
使用translator
方法比使用正则表达式替换更快
import re, string, time
pattern = '[!#?,.:";]'
def regex_sub(input_string):
return re.sub(pattern, '', input_string)
translator = str.maketrans('', '', string.punctuation)
def string_translator(input_string):
return input_string.translate(translator)
input_string = """cwsx#?;.frvcdr"""
string_translator(input_string)
regex_sub(input_string)
passes = 1000000
t1 = time()
for i in range(passes):
a = string_translator(input_string)
t2 = time()
for i in range(passes):
a = regex_sub(input_string)
t3 = time()
string_translator_time = t2 - t1
regex_sub_time = t3 - t2
print(string_translator_time) # 1.341651439666748
print(regex_sub_time) # 3.44773268699646