Question

对于一个字符串，下面的代码删除了unicode字符＆amp;新行/回车：

t = "We've\xe5\xcabeen invited to attend TEDxTeen, an independently organized TED event focused on encouraging youth to find \x89\xdb\xcfsimply irresistible\x89\xdb\x9d solutions to the complex issues we face every day.,"

t2 = t.decode('unicode_escape').encode('ascii', 'ignore').strip()
import sys
sys.stdout.write(t2.strip('\n\r'))

但是当我尝试在pandas中编写一个函数来将它应用于列的每个单元格时，它会因为属性错误而失败，或者我收到一条警告，表示试图在一个切片的副本上设置一个值来自DataFrame

def clean_text(row):
    row= row["text"].decode('unicode_escape').encode('ascii', 'ignore')#.strip()
    import sys
    sys.stdout.write(row.strip('\n\r'))
    return row

应用于我的数据框：

df["text"] = df.apply(clean_text, axis=1)

如何将此代码应用于系列的每个元素？

Answer 1

问题似乎是您尝试访问和更改row['text']并在执行应用功能时返回行本身，当您在apply上执行DataFrame时，它会应用于每个系列，所以如果改为这应该有帮助：

import pandas as pd

df = pd.DataFrame([t for _ in range(5)], columns=['text'])

df 
                                                text
0  We've������been invited to attend TEDxTeen, an ind...
1  We've������been invited to attend TEDxTeen, an ind...
2  We've������been invited to attend TEDxTeen, an ind...
3  We've������been invited to attend TEDxTeen, an ind...
4  We've������been invited to attend TEDxTeen, an ind...

def clean_text(row):
    # return the list of decoded cell in the Series instead 
    return [r.decode('unicode_escape').encode('ascii', 'ignore') for r in row]

df['text'] = df.apply(clean_text)

df
                                                text
0  We'vebeen invited to attend TEDxTeen, an indep...
1  We'vebeen invited to attend TEDxTeen, an indep...
2  We'vebeen invited to attend TEDxTeen, an indep...
3  We'vebeen invited to attend TEDxTeen, an indep...
4  We'vebeen invited to attend TEDxTeen, an indep...

或者，您可以使用lambda，如下所示，并直接仅适用于text列：

df['text'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
                                          encode('ascii', 'ignore').\
                                          strip())

Answer 2

我实际上无法重现您的错误：以下代码为我运行而没有错误或警告。

df = pd.DataFrame([t,t,t],columns = ['text'])
df["text"] = df.apply(clean_text, axis=1)

如果有帮助，我认为解决此类问题的更多“熊猫”方法可能是使用带有DataFrame.str方法之一的正则表达式：

df["text"] =  df.text.str.replace('[^\x00-\x7F]','')

Answer 3

像这样，其中column_to_convert是您要转换的列：

series = df['column_to_convert']
df["text"] =  [s.encode('ascii', 'ignore').strip()
               for s in series.str.decode('unicode_escape')]

从pandas中的文本中删除unicode

3 个答案: