为什么在我手动向数据框添加列后,pandas数据框中的某些值会从字符串变为整数?

时间:2018-06-14 06:13:19

标签: python python-3.x pandas

我已经从CSV文件制作了一个pandas数据框,如下所示:

import pandas as pd
data = pd.read_csv('dataset.csv')

其中有一个名为CLASS的列。这些是CLASS

中包含的值
from collections import Counter
Counter(CLASS)
Out [1]: Counter({'1': 60783, '2': 37313, '3': 2564, '4': 959, ' ': 346, 'D': 27})

现在,我手动向数据框添加一列,并将其保存在新的csv:

data['DURATION'] = DURATION
data.to_csv('new_dataset.csv')

然后,当我打开新的CSV并检查CLASS中的值时,其中一些已成为整数!

dataset = pd.read_csv('new_dataset.csv')
CLASS = dataset['OCCUP_CLASS']
Counter(CLASS)
Out [1]: Counter({' ': 346,
                  1: 48418,
                  '1': 12365,
                  2: 16189,
                  '2': 21124,
                  3: 848,
                  '3': 1716,
                  4: 81,
                  '4': 878,
                  'D': 43})

为什么会这样?这会产生问题,因为我无法绘制或制作CLASS的直方图,而在我能够这样做之前:

import matplotlib.pyplot as plt
plt.plot(CLASS)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-158-b6bafcfd7ad5> in <module>()
      1 import matplotlib.pyplot as plt
----> 2 plt.plot(CLASS)

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\pyplot.py in plot(*args, **kwargs)
   3356                       mplDeprecation)
   3357     try:
-> 3358         ret = ax.plot(*args, **kwargs)
   3359     finally:
   3360         ax._hold = washold

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
   1853                         "the Matplotlib list!)" % (label_namer, func.__name__),
   1854                         RuntimeWarning, stacklevel=2)
-> 1855             return func(ax, *args, **kwargs)
   1856 
   1857         inner.__doc__ = _add_data_doc(inner.__doc__,

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_axes.py in plot(self, *args, **kwargs)
   1526 
   1527         for line in self._get_lines(*args, **kwargs):
-> 1528             self.add_line(line)
   1529             lines.append(line)
   1530 

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_base.py in add_line(self, line)
   1930             line.set_clip_path(self.patch)
   1931 
-> 1932         self._update_line_limits(line)
   1933         if not line.get_label():
   1934             line.set_label('_line%d' % len(self.lines))

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_base.py in _update_line_limits(self, line)
   1952         Figures out the data limit of the given line, updating self.dataLim.
   1953         """
-> 1954         path = line.get_path()
   1955         if path.vertices.size == 0:
   1956             return

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\lines.py in get_path(self)
    949         """
    950         if self._invalidy or self._invalidx:
--> 951             self.recache()
    952         return self._path
    953 

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\lines.py in recache(self, always)
    655         if always or self._invalidy:
    656             yconv = self.convert_yunits(self._yorig)
--> 657             y = _to_unmasked_float_array(yconv).ravel()
    658         else:
    659             y = self._y

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\cbook\__init__.py in _to_unmasked_float_array(x)
   2048         return np.ma.asarray(x, float).filled(np.nan)
   2049     else:
-> 2050         return np.asarray(x, float)
   2051 
   2052 

c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
    490 
    491     """
--> 492     return array(a, dtype, copy=False, order=order)
    493 
    494 

ValueError: could not convert string to float: 

enter image description here

编辑:从数据集中添加2个相关列的前20行:

DURATION    CLASS
10          1
14          1
-1          1
-1          1
0           1
-1          1
14          2
8           2
-1          1
14          3
-1          3
-1          
-1          4
-1          4
-1          3
8           1
-1          2
-1          2
-1          1

编辑2: print(dataset['CLASS'].value_counts())的输出:

import pandas as pd
dataset = pd.read_csv('dataset.csv', dtype={'CLASS': str})
print(dataset['CLASS'].value_counts())

1    48418
2    21124
2    16189
1    12365
3     1716
4      878
3      848
       346
4       81
D       43
Name: CLASS, dtype: int64

编辑3:对于空白元素,绘图不是问题,如下图所示,原始数据显示x轴上的空白点:

enter image description here

1 个答案:

答案 0 :(得分:0)

Pandas尝试检测列的数据类型,但有时会因为您注意到而失败。您可以像read_csv那样强制列的数据类型:

dataset = pd.read_csv('new_dataset.csv', dtype={'CLASS': str})