Question

有一个数据文件，每行末尾都有\n\n http://pan.baidu.com/s/1o6jq5q6
我的系统：win7 + python3.3 + R-3.0.3
在R

sessionInfo()

[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936

在python中：chcp 936

我可以在R.中读到它。

read.table("test.pandas",sep=",",header=TRUE)

这很简单。

我可以在python中读取它以获得几乎相同的输出。

fr=open("g:\\test.pandas","r",encoding="gbk").read()
data=[x for x in fr.splitlines() if x.strip() !=""]
for id,char in enumerate(data):
    print(str(id)+","+char)

当我在python模块pandas中读到它时，

import pandas as pd
pd.read_csv("test.pandas",sep=",",encoding="gbk")

我在输出中发现了两个问题：
1）如何正确对齐（我在其他帖子中提出的问题）
how to set alignment in pandas in python with non-ANSI characters
2）每个真实数据中都有一条NaN线。

我可以改进我的pandas代码以便在控制台中更好地显示吗？

enter image description here

Answer 1

使用open('test.pandas', 'rb')读取时，您的文件似乎包含'\ r \ n \ r \ n'作为其行终止符。 Python 3.3似乎将其转换为'\ n \ n'，而Python 2.7在使用open('test.pandas', 'r', encoding='gbk')读取时将其转换为'\ r \ n'。

pandas.read_csv确实有一个lineterminator参数，但它只接受单字符终结符。

你可以做的是在将文件传递给pandas.read_csv()之前对文件进行一些处理，你可以使用StringIO将文件接口中的字符串缓冲区包装起来，这样你就不需要了先写出一个临时文件。

import pandas as pd
from io import StringIO

with open('test.pandas', 'r', encoding='gbk') as in_file:
    contents = in_file.read().replace('\n\n', '\n')

df = pd.read_csv(StringIO(contents))

（下面的输出我没有GBK字符集。）

>>> df[0:10]
          ??????? ???    ????????
0    HuangTianhui  ??  1948/05/28
1          ??????   ?  1952/03/27
2             ???   ?  1994/12/09
3        LuiChing   ?  1969/08/02
4            ????  ??  1982/03/01
5            ????  ??  1983/08/03
6      YangJiabao   ?  1988/08/25
7  ??????????????  ??  1979/07/10
8          ??????   ?  1949/10/20
9           ???»?   ?  1951/10/21

在Python 2.7中，StringIO()位于模块StringIO而不是io。

如何读取python模块pandas中的“\ n \ n”？

1 个答案: