大熊猫中的字符串无法正确打印

时间:2019-03-11 13:30:33

标签: python string pandas encoding

我正在使用熊猫加载包含Twitter消息的csv文件

corpus = pd.read_csv(data_path, encoding='utf-8')

以下是数据示例

label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""

当我尝试打印评论时,我得到:

print(corpus.iloc[1]['comment'])
>> "i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."

\ xa0仍在输出中。但是,如果我从文件中粘贴字符串并打印出来,我将得到正确的输出

print("""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.""")
>> i really don't understand your point.  It seems that you are mixing apples and oranges.

我想知道为什么两个输出不同,以及是否有一种方法可以正确打印熊猫中的字符串?我想如果有更好的解决方案,那就替换掉,因为数据包含许多其他Unicode表示形式,例如\ xe1,\ u0111,\ u01b0,\ u1edd等。

1 个答案:

答案 0 :(得分:0)

熊猫加载的输入数据文件必须为ASCII。如果在UTF-8中,则UTF-8编码器将正确加载UTF-8字节。如果该文件不是UTF-8,则仍将加载熊猫,并且转义的\ xa0将按字面意义加载,并且不会转换为所需的Unicode不间断空格。

当您复制/粘贴时它起作用的原因是python在字符串文字中看到了转义。

import pandas as pd
data = {u"label": 0, u"date": u"20120528192215Z", u"comment": u"\"i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.\""}
df = pd.DataFrame(index=[1], data=data)
df.to_csv("/tmp/corpusutf8.csv", index=False, encoding="utf-8")
pd.read_csv("/tmp/corpusutf8.csv")
                                             comment             date  label
0  "i really don't understand your point.  It see...  20120528192215Z      0
df['comment']
1    "i really don't understand your point.  It see...
Name: comment, dtype: object

file /tmp/corpus.csv
/tmp/corpusutf8.csv: UTF-8 Unicode text

如果csv是用\ xa0构造的并且是ascii,则尽管指定了utf-8编码,但Pandas仍会以ascii的方式加载。

cat /tmp/corpusascii.csv
label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""
file !$
file /tmp/corpusascii.csv
/tmp/corpusascii.csv: ASCII text
df1 = pd.read_csv("/tmp/corpusascii.csv", encoding="utf-8")
df1
   label             date                                            comment
0      0  20120528192215Z  "i really don't understand your point.\xa0 It ...