我正在尝试使用以下命令将函数分配给我的字典值:
x_text = [clean_str(v) for k, v in answer.items()]
函数clean_str:
def clean_str(string):
# remove stopwords
# string = ' '.join([word for word in string.split() if word not in cachedStopWords])
string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
string = re.sub(r"\'s", " \'s", string)
string = re.sub(r"\'ve", " \'ve", string)
string = re.sub(r"n\'t", " n\'t", string)
string = re.sub(r"\'re", " \'re", string)
string = re.sub(r"\'d", " \'d", string)
string = re.sub(r"\'ll", " \'ll", string)
string = re.sub(r",", " , ", string)
string = re.sub(r"!", " ! ", string)
string = re.sub(r"\(", " \( ", string)
string = re.sub(r"\)", " \) ", string)
string = re.sub(r"\?", " \? ", string)
string = re.sub(r"\s{2,}", " ", string)
return string.strip().lower()
但是我遇到以下错误:
文件“ C:\ ProgramData \ Anaconda3 \ lib \ re.py”,第191行,在子目录中 返回_compile(pattern,flags).sub(repl,string,count)
TypeError:预期的字符串或类似字节的对象
下面是我的字典(answer {})的前2 k,v对的摘录:
In[45]:{k: answer[k] for k in list(answer)[:2]}
Out[45]:
{b'B00308CJ12': [b'Bulletproof Salesman (2008)'],
b'189138922X': [b'Classical Mechanics']}
答案 0 :(得分:0)
字典的值全是字节,而不是字符串,并且re.sub
仅能处理字符串。
您应该使用decode()
方法将字节转换为字符串:
x_text = [clean_str(i.decode()) for k, v in answer.items() for i in v]