我有文本字符串,并且正在使用以下字符串函数对其进行清理。现在,我想缩放它并将其应用于dataframe。我面临的挑战是它不适用于数据框。我尝试在numpy数组上应用,但结果为空字符串。
数据框是单列,具有与行变量相同的字符串:
0
0 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US...
1 Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/2...
2 Mozilla/5.0 (iPod; U; CPU iPhone OS 4_1 like M...
3 Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/201...
4 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ...
``
line = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; handyCafeCln/3.3.21)"
re_print = re.compile('[^%s]' % re.escape(string.printable))
remove_digits = str.maketrans('', '', digits)
remove_punc =str.maketrans('', '', string.punctuation)
line = line.translate(remove_digits)
line = line.translate(remove_punc)
line = line.split()
结果:
['Mozilla', '兼容', “ MSIE”, '视窗', 'NT', '净', 'CLR', 'handyCafeCln']
我尝试将相同的步骤打包到一个函数中,但是无法将其应用于datframe并出现以下错误Series' object has no attribute 'translate
def clean_pairs(lines):
re_print = re.compile('[^%s]' % re.escape(string.printable))
remove_digits = str.maketrans('', '', digits)
remove_punc =str.maketrans('', '', string.punctuation)
lines.translate(remove_digits)
lines.translate(remove_punc)
lines.split()
df.apply(clean_pairs)
答案 0 :(得分:1)
def clean_pairs(lines):
re_print = re.compile('[^%s]' % re.escape(string.printable))
remove_digits = str.maketrans('', '', string.digits)
remove_punc =str.maketrans('', '', string.punctuation)
lines = lines.translate(remove_digits)
lines = lines.translate(remove_punc)
lines = lines.split()
return lines
df = pd.DataFrame([line])
print(df[0].apply(clean_pairs))