在熊猫数据框中查找和替换坏字符

时间:2020-01-31 17:50:02

标签: python pandas unicode

我在尝试摆脱熊猫数据框中的不良字符时遇到了麻烦。这是一个自动脚本,用于处理需要保存在cp1252中的传入数据,并且我希望能够通过解析错误来动态处理任何问题字符。我不在乎用什么替换它们。我已经试过一百万种变体,而且一筹莫展(这是python 3 pandas 25)

while True:
    try:
        print('saving')
        data.to_csv('total.csv', index=False, quoting=csv.QUOTE_ALL, encoding='cp1252')
        break
    except UnicodeEncodeError as e:
        print(e)
        badchar = re.search(r"character (.+?) in", str(e)).group(1)
        print('Found bad character, removing. . . ')
        uchar = u"{}".format(badchar)
        print(uchar)
        data = data.replace(uchar.encode('utf-8'), '')

返回:

saving
'charmap' codec can't encode character '\u2264' in position 399: character maps to <undefined>
Found bad character, removing. . . 
'\u2264'
saving
'charmap' codec can't encode character '\u2264' in position 399: character maps to <undefined>
Found bad character, removing. . . 
'\u2264'
saving
'charmap' codec can't encode character '\u2264' in position 399: character maps to <undefined>
Found bad character, removing. . . 
'\u2264'
saving

我尝试了很多变化:

data = data.replace(uchar, '')

data = data.replace(uchar.encode('utf-8').decode('utf-8'), '') 等。 。

我还尝试了u'\ 2264',u'u \ 2264'

我在数据框中也找不到。这不会返回任何内容:

for col in data:
    if sum(data[col].astype(str).str.contains(u'\2264')) > 0:
        print(col)

任何帮助将不胜感激,谢谢!

1 个答案:

答案 0 :(得分:0)

您必须在正则表达式中使用替换功能: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

df.replace(to_replace=r'^ba.$', value='new', regex=True)