设置pandas.DataFrame字符串dtype(不基于文件)

时间:2016-09-20 18:00:37

标签: python pandas numpy

我在使用pandas.DataFrame的构造函数和使用dtype参数时遇到问题。我想保留字符串值,但以下代码段始终转换为数字类型,然后生成NaN s。

from __future__ import unicode_literals
from __future__ import print_function


import numpy as np
import pandas as pd


def main():
    columns = ['great', 'good', 'average', 'bad', 'horrible']
    # minimal example, dates are coming (as strings) from some
    # non-file source.
    example_data = {
        'alice': ['', '', '', '2016-05-24', ''],
        'bob': ['', '2015-01-02', '', '', '2012-09-15'],
        'eve': ['2011-12-31', '', '1998-08-13', '', ''],
    }

    # first pass, yields dataframe full of NaNs
    df = pd.DataFrame(data=example_data, index=example_data.keys(),
        columns=columns, dtype=str) #or string, 'str', 'string', 'object'
    print(df.dtypes)
    print(df)
    print()

    # based on https://github.com/pydata/pandas/blob/master/pandas/core/frame.py
    # and https://github.com/pydata/pandas/blob/37f95cef85834207db0930e863341efb285e38a2/pandas/types/common.py
    # we're ultimately feeding dtype to numpy's dtype, so let's just use that:
    #     (using np.dtype('S10') and converting to str doesn't work either)
    df = pd.DataFrame(data=example_data, index=example_data.keys(),
        columns=columns, dtype=np.dtype('U'))
    print(df.dtypes)
    print(df) # still full of NaNs... =(



if __name__ == '__main__':
    main()

dtypes的哪些值将保留数据框中的字符串?

供参考:

  

$ python --version

     

2.7.12

     

$ pip2 list | grep pandas

     

pandas(0.18.1)

     

$ pip2 list | grep numpy

     

numpy(1.11.1)

2 个答案:

答案 0 :(得分:1)

对于OP中的特定情况,您可以使用DataFrame.from_dict() constructor(另请参阅DataFrame文档的Alternate Constructors部分)。

from __future__ import unicode_literals
from __future__ import print_function

import pandas as pd

columns = ['great', 'good', 'average', 'bad', 'horrible']
example_data = {
    'alice': ['', '', '', '2016-05-24', ''],
    'bob': ['', '2015-01-02', '', '', '2012-09-15'],
    'eve': ['2011-12-31', '', '1998-08-13', '', ''],
}
df = pd.DataFrame.from_dict(example_data, orient='index')
df.columns = columns

print(df.dtypes)
# great       object
# good        object
# average     object
# bad         object
# horrible    object
# dtype: object

print(df)
#             great        good     average         bad    horrible
# bob                2015-01-02                          2012-09-15
# eve    2011-12-31              1998-08-13                        
# alice                                      2016-05-24     

您甚至可以在dtype=str中指定DataFrame.from_dict() - 尽管在此示例中没有必要。

编辑:DataFrame构造函数将字典解释为列的集合:

print(pd.DataFrame(example_data))

#         alice         bob         eve
# 0                          2011-12-31
# 1              2015-01-02            
# 2                          1998-08-13
# 3  2016-05-24                        
# 4              2012-09-15            

(我放弃data=,因为data是函数签名中的第一个参数。您的代码会混淆行和列:

print(pd.DataFrame(example_data, index=example_data.keys(), columns=columns))

#       great good average  bad horrible
# alice   NaN  NaN     NaN  NaN      NaN
# bob     NaN  NaN     NaN  NaN      NaN
# eve     NaN  NaN     NaN  NaN      NaN   

(虽然我并不完全确定它最终是如何为您提供NaN s的数据帧。

是正确的
print(pd.DataFrame(example_data, columns=example_data.keys(), index=columns))

#                alice         bob         eve
# great                             2011-12-31
# good                  2015-01-02            
# average                           1998-08-13
# bad       2016-05-24                        
# horrible              2012-09-15   

实际上不需要指定列名 - 它们已经从字典解析:

print(pd.DataFrame(example_data, index=columns))

#                alice         bob         eve
# great                             2011-12-31
# good                  2015-01-02            
# average                           1998-08-13
# bad       2016-05-24                        
# horrible              2012-09-15                     

你想要的实际上是转置 - 所以你也可以采取这种转置!

print(pd.DataFrame(data=example_data, index=columns).T)

#             great        good     average         bad    horrible
# alice                                      2016-05-24            
# bob                2015-01-02                          2012-09-15
# eve    2011-12-31              1998-08-13               

答案 1 :(得分:0)

这不是一个正确的答案,但是当你得到别人的答案时,我注意到使用read_csv功能一切正常。

因此,如果您将数据放在名为.csv的{​​{1}}文件中,请执行以下操作:

myData.csv

并做

great,good,average,bad,horrible
alice,,,,2016-05-24,
bob,,2015-01-02,,,2012-09-15
eve,2011-12-31,,1998-08-13,,

它会保持字符串不变!

df = pd.read_csv('blablah/myData.csv')

如果需要,可以将空值作为空格放在csv文件或任何其他字符/标记中。