Question

我有一个recurring problem在Python中将大量数据保存到csv。这些数字是毫秒的纪元时间戳，我无法转换或截断，必须以这种格式保存。由于具有毫秒时间戳的列也包含一些NaN值，因此pandas会自动将它们转换为float（请参阅下面的the documentation＆＃34;支持整数NA＆＃34;

我似乎无法避免这种行为，所以我的问题是，当使用df.to_csv时，如何将这些数字保存为整数值，即没有小数点或尾随零？我在同一个数据框中有不同浮点精度数的列，我不想丢失那里的信息。使用to_csv中的float_format参数似乎对我的数据框中的所有浮点列应用相同的格式。

一个例子：

>>> df = pd.DataFrame({'a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
>>> df['b'].dtype
Out[1]: dtype('int64')
>>> df.loc[2] = np.NaN
>>> df
Out[1]: 
       a             b
0   1.25  1.424380e+12
1   2.54  1.425511e+12
2    NaN           NaN
>>> df['b'].dtype
dtype('float64')
>>> df.to_csv('test.csv')
>>> with open ('test.csv') as f:
...     for line in f:
...         print(line)
,a,b
0,1.25,1.42438044944e+12
1,2.54,1.42551073119e+12
2,,

正如你所看到的，我失去了我的纪元时间戳的最后两位数的精度。

Answer 1

虽然pd.to_csv没有用于更改单个列格式的参数，但pd.to_string却有。这有点麻烦，对于非常大的DataFrame可能会有问题，但您可以使用它来生成格式正确的字符串，然后将该字符串写入文件（如此answer中建议的类似问题）。 to_string的{{1}}参数例如使用函数字典来格式化各列。在您的情况下，您可以为formatters列编写自己的自定义格式化程序，保留其他列的默认值。这个格式化程序可能看起来像这样：

"b"

现在您可以使用它来生成字符串：

def printInt(b):
    if pd.isnull(b):
        return "NaN"
    else:
        return "{:d}".format(int(b))

给出：

df.to_string(formatters={"b": printInt}, na_rep="NaN")

您可以看到仍有问题，这不是逗号分隔，' a b\n0 1.25 1424380449437\n1 2.54 1425510731187\n2 NaN NaN'实际上没有参数来设置自定义分隔符，但这可以通过正则表达式轻松修复：

to_string

给出：

import re
re.sub("[ \t]+(NaN)?", ",",
       df.to_string(formatters={"b": printInt}, na_rep="NaN"))

现在可以将其写入文件：

',a,b\n0,1.25,1424380449437\n1,2.54,1425510731187\n2,,'

导致你想要的东西：

with open("/tmp/test.csv", "w") as f:
    print(re.sub("[ \t]+(NaN)?", ",",
                 df.to_string(formatters={"b": printInt}, na_rep="NaN")),
          file=f)

如果您想将,a,b 0,1.25,1424380449437 1,2.54,1425510731187 2,,保留在csv文件中，您只需更改正则表达式：

NaN

会给：

with open("/tmp/test.csv", "w") as f:
    print(re.sub("[ \t]+", ",",
                 df.to_string(formatters={"b": printInt}, na_rep="NaN")),
          file=f)

如果您的DataFrame之前包含带有空格的字符串，那么强大的解决方案就不那么容易了。您可以在每个值的前面插入另一个字符，表示下一个条目的开头。如果所有字符串中只有一个空格，则可以使用另一个空格。这会将代码更改为：

,a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,NaN,NaN

会给出：

import pandas as pd
import numpy as np
import re

df = pd.DataFrame({'a a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
df.loc[2] = np.NaN

def printInt(b):
    if pd.isnull(b):
        return " NaN"
    else:
        return " {:d}".format(int(b))

def printFloat(a):
    if pd.isnull(a):
        return " NaN"
    else:
        return " {}".format(a)

with open("/tmp/test.csv", "w") as f:
    print(re.sub("[ \t][ \t]+", ",",
                 df.to_string(formatters={"a": printFloat, "b": printInt},
                              na_rep="NaN", col_space=2)),
          file=f)

Answer 2

也许这可行：

import thread

def func(Inp, res):
    res.append(list(Inp))
    return res

train_set = ['c++', 'python']
result=[]

try:
   thread.start_new_thread(func,(train_set[0],result))
   thread.start_new_thread(func,(train_set[1],result))
except:
   print "Error: unable to start thread"

for i in range(20):
   print(result)
   pass

您的输出应该是这样的（我在Mac上）：

Answer 3

我对大数有同样的问题，这是excel文件的正确方法 df = "\t" + df

Python pandas大浮动与to_csv

3 个答案: