我已经从CSV文件制作了一个pandas数据框,如下所示:
import pandas as pd
data = pd.read_csv('dataset.csv')
其中有一个名为CLASS
的列。这些是CLASS
:
from collections import Counter
Counter(CLASS)
Out [1]: Counter({'1': 60783, '2': 37313, '3': 2564, '4': 959, ' ': 346, 'D': 27})
现在,我手动向数据框添加一列,并将其保存在新的csv:
中data['DURATION'] = DURATION
data.to_csv('new_dataset.csv')
然后,当我打开新的CSV并检查CLASS
中的值时,其中一些已成为整数!
dataset = pd.read_csv('new_dataset.csv')
CLASS = dataset['OCCUP_CLASS']
Counter(CLASS)
Out [1]: Counter({' ': 346,
1: 48418,
'1': 12365,
2: 16189,
'2': 21124,
3: 848,
'3': 1716,
4: 81,
'4': 878,
'D': 43})
为什么会这样?这会产生问题,因为我无法绘制或制作CLASS
的直方图,而在我能够这样做之前:
import matplotlib.pyplot as plt
plt.plot(CLASS)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-158-b6bafcfd7ad5> in <module>()
1 import matplotlib.pyplot as plt
----> 2 plt.plot(CLASS)
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\pyplot.py in plot(*args, **kwargs)
3356 mplDeprecation)
3357 try:
-> 3358 ret = ax.plot(*args, **kwargs)
3359 finally:
3360 ax._hold = washold
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
1853 "the Matplotlib list!)" % (label_namer, func.__name__),
1854 RuntimeWarning, stacklevel=2)
-> 1855 return func(ax, *args, **kwargs)
1856
1857 inner.__doc__ = _add_data_doc(inner.__doc__,
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_axes.py in plot(self, *args, **kwargs)
1526
1527 for line in self._get_lines(*args, **kwargs):
-> 1528 self.add_line(line)
1529 lines.append(line)
1530
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_base.py in add_line(self, line)
1930 line.set_clip_path(self.patch)
1931
-> 1932 self._update_line_limits(line)
1933 if not line.get_label():
1934 line.set_label('_line%d' % len(self.lines))
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\axes\_base.py in _update_line_limits(self, line)
1952 Figures out the data limit of the given line, updating self.dataLim.
1953 """
-> 1954 path = line.get_path()
1955 if path.vertices.size == 0:
1956 return
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\lines.py in get_path(self)
949 """
950 if self._invalidy or self._invalidx:
--> 951 self.recache()
952 return self._path
953
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\lines.py in recache(self, always)
655 if always or self._invalidy:
656 yconv = self.convert_yunits(self._yorig)
--> 657 y = _to_unmasked_float_array(yconv).ravel()
658 else:
659 y = self._y
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\matplotlib\cbook\__init__.py in _to_unmasked_float_array(x)
2048 return np.ma.asarray(x, float).filled(np.nan)
2049 else:
-> 2050 return np.asarray(x, float)
2051
2052
c:\users\h473\appdata\local\programs\python\python35\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
ValueError: could not convert string to float:
编辑:从数据集中添加2个相关列的前20行:
DURATION CLASS
10 1
14 1
-1 1
-1 1
0 1
-1 1
14 2
8 2
-1 1
14 3
-1 3
-1
-1 4
-1 4
-1 3
8 1
-1 2
-1 2
-1 1
编辑2: print(dataset['CLASS'].value_counts())
的输出:
import pandas as pd
dataset = pd.read_csv('dataset.csv', dtype={'CLASS': str})
print(dataset['CLASS'].value_counts())
1 48418
2 21124
2 16189
1 12365
3 1716
4 878
3 848
346
4 81
D 43
Name: CLASS, dtype: int64
编辑3:对于空白元素,绘图不是问题,如下图所示,原始数据显示x轴上的空白点:
答案 0 :(得分:0)
Pandas尝试检测列的数据类型,但有时会因为您注意到而失败。您可以像read_csv
那样强制列的数据类型:
dataset = pd.read_csv('new_dataset.csv', dtype={'CLASS': str})