Question

使用包含多列国际货币符号的Excel文件。除了那个文件一些国际语言。

Example: Paying £40.50 doesn't make any sense for a one-hour parking. 
Example: Produkty są zbyt drogie (Polish)
Example: 15% de la population féminine n'obtient pas de bons emplois (French)

作为采取行动后的清理过程

df = df.apply(lambda x: x.str.replace('\\r',' '))
df = df.apply(lambda x: x.str.replace('\\n',' '))
df = df.apply(lambda x: x.str.replace('\.+', ''))
df = df.apply(lambda x: x.str.replace('-', ''))
df = df.apply(lambda x: x.str.replace('&', ''))
df = df.apply(lambda x: x.str.replace(r"[\"\',]", ''))
df = df.apply(lambda x: x.str.replace('[%*]', ''), axis=1)

（如果有更有效的方式 - 超过欢迎）

除此之外：已创建方法以删除停用词

def cleanup(row):
    stops = set(stopwords.words('english'))
    removedStopWords = " ".join([str(i) for i in row.lower().split() 
    return removedStopWords

将此方法应用于包含上述示例的数据框中的所有列：

df = df.applymap(self._row_cleaner)['ComplainColumns']

但UnicodeEncodeError是最大的问题。它首先在英镑标志上抛出这个错误。

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 646: ordinal not in range(128)

试过以下： df = df.apply(lambda x: x.unicode.replace(u'\xa3', ''))肠道不起作用。

目标是将所有非字母字符替换为''或' '

Answer 1

如果要替换除[A-z0-9]以外的所有字符，则可以使用替换为正则表达式，即

 df = df.replace('[^\w\s]','',regex=True)

数据框中可能缺少数据，因此您可能需要使用astype（str），因为您使用列表理解.lower()，Nan将被视为float。

df.astype(str).apply(cleanup)

熊猫数据框取代国际货币符号

1 个答案: