Question

我在熊猫中有一个很大的数据框，在其中遍历单个列（包含字符串单元格）以进行数据清理。数据非常嘈杂，包含大量HTML字符和C ++风格的unicode内容（例如“此处有一些文字\ u00a0，也许还有其他一些文字”或“ \ u2013”）。

我已经过滤掉了HTML，但是unicode仍然存在，我真的想摆脱它，以保留尽可能可读的文本。我目前的想法是将将字符串完全存储在其中的变量转换为Unicode（例如u'\ u00a0'）格式，然后将其转换回字符串以重新分配给单元格，以某种方式消除所有这些代码。但是，我整天都在寻找可以进行转换的东西，但是找不到适合我的东西。消除这些子字符串的简便方法是什么？

我尝试过：

u'some string'->不起作用，因为我使用的是变量而不是文字

string.encode('utf-8')

string.decode('utf-8')

这是我正在使用的当前代码：

''' ＃导入东西

file_name = 'myfile.json'
df = pd.read_json(file_name)

for x in range(0,len(df['col'])):
    note = df.iloc[x]['col']
#BEGIN FILTERING OUT HTML
    pos1 = note.find('<')
    pos2 = note.find('>', pos1)
    while pos1 != -1 and pos2 != -1 :
        if '<' in note and note.find('>', pos1):
            note = note.replace(note[pos1:pos2+1], '')
            pos1 = note.find('<')
            pos2 = note.find('>', pos1)
    note = ' '.join(re.findall(r"[\w%-.']+", note))

#SOMETHING TO REMOVE UNICODE HERE

    df.at[x, 'col'] = note

#Continues on to save file
df.to_json('newfile.json', orient = 'records')

'''

过滤Unicode /将Unicode转换为字符（Python）

0 个答案: