更新:
TomNash的答案解决了所问的问题。但是,在我的实际问题中尝试使用它会导致引用的列名出现问题,缺少数据时出现问题等。为了避免这种情况,我在注释中使用了CJR的建议,只是使我的DataFrame腌制。
以下原始问题:
我的内存中有一个熊猫的DataFrame。我希望能够将其写入文件(使用to_csv
),然后使用read_csv
将结果读取到新的DataFrame中。我希望原始的DataFrame和新的“ from file DataFrame”具有相同的数据类型。
我试图通过对quoting
和quotechar
使用to_csv
和read_csv
参数来使此工作正常进行。但是,这似乎并不能解决问题。
我了解到read_csv
的{{1}}参数可用于强制数据类型,但这在我的用例(用于回归测试的大量自动生成文件)中不可行。
下面的完整示例。
dtype
:
tmp.py
运行import pandas as pd
from csv import QUOTE_NONNUMERIC
import sys
print('Python version information:')
print(sys.version)
print('Pandas version information:')
print(pd.__version__)
df1 = pd.DataFrame([['A', '100', 100], ['B', '200', 200]])
print('df1:')
print(df1.info())
df1.to_csv('tmp.csv', index=False, quoting=QUOTE_NONNUMERIC,
quotechar='"')
df2 = pd.read_csv('tmp.csv', quoting=QUOTE_NONNUMERIC, quotechar='"')
print('df2:')
print(df2.info())
的输出:
tmp.py
Python version information:
3.7.3 (default, Jun 11 2019, 01:11:15)
[GCC 6.3.0 20170516]
Pandas version information:
0.24.2
df1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0 2 non-null object
1 2 non-null object
2 2 non-null int64
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes
None
df2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0 2 non-null object
1 2 non-null float64
2 2 non-null float64
dtypes: float64(2), object(1)
memory usage: 128.0+ bytes
None
。object
,dtype是df1
,而对于object
,dtype是df2
。float64
的类型为df1
,而int64
的类型为df2
。如csv module所述,float64
“指示读者将所有未引用的字段转换为float类型。” csv.QUOTE_NONNUMERIC
的内容如下。请注意,第二栏被引用,因此我希望tmp.csv
给我一个对象。
read_csv
:
tmp.csv
答案 0 :(得分:3)
尝试在读取时使用QUOTE_NONE
,这样可以保留读/写之间的数据类型。
将原始数据集与int64
一起使用:
import pandas as pd
from csv import QUOTE_NONNUMERIC, QUOTE_NONE
import sys
print('Python version information:')
print(sys.version)
print('Pandas version information:')
print(pd.__version__)
df1 = pd.DataFrame([['A', '100', 100], ['B', '200', 200]])
print('df1:')
print(df1.info())
df1.to_csv('tmp.csv', index=False, quoting=QUOTE_NONNUMERIC, quotechar='"')
df2 = pd.read_csv('tmp.csv', quoting=QUOTE_NONE).replace('"','', regex=True)
print('df2:')
print(df2.info())
结果:
Python version information:
3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]
Pandas version information:
0.24.2
df1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0 2 non-null object
1 2 non-null object
2 2 non-null int64
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes
None
df2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0 2 non-null object
1 2 non-null object
2 2 non-null int64
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes
None
在输入中使用float64
值:
import pandas as pd
from csv import QUOTE_NONNUMERIC, QUOTE_NONE, QUOTE_MINIMAL
import sys
print('Python version information:')
print(sys.version)
print('Pandas version information:')
print(pd.__version__)
df1 = pd.DataFrame([['A', '100', 100.1], ['B', '200', 200.2]])
print('df1:')
print(df1.info())
df1.to_csv('tmp.csv', index=False, quoting=QUOTE_NONNUMERIC, quotechar='"')
df2 = pd.read_csv('tmp.csv', quoting=QUOTE_NONE).replace('"','', regex=True)
print('df2:')
print(df2.info())
结果:
Python version information:
3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]
Pandas version information:
0.24.2
df1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0 2 non-null object
1 2 non-null object
2 2 non-null float64
dtypes: float64(1), object(2)
memory usage: 128.0+ bytes
None
df2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0 2 non-null object
1 2 non-null object
2 2 non-null float64
dtypes: float64(1), object(2)
memory usage: 128.0+ bytes
None