如果我只是阅读一段csv,我会得到以下数据结构
<div data-bind="component: Re"></div>
<script>
function ViewModel()
{
var Red = bla bla bla
}
</script>
如果我正在读取整个csv并按照上面的方法连接块,我会得到以下结构:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (2015-11-01 00:00:00, 4980770) to (2016-06-01 00:00:00, 8850573)
Data columns (total 5 columns):
CHANNEL 100000 non-null category
MCC 92660 non-null category
DOMESTIC_FLAG 100000 non-null category
AMOUNT 100000 non-null float32
CNT 100000 non-null uint8
dtypes: category(3), float32(1), uint8(1)
memory usage: 1.9+ MB
为什么分类变量更改为object / float64?我怎样才能避免这种类型的改变? ESP。 float64
这是连接代码:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 30345312 entries, (2015-11-01 00:00:00, 4980770) to (2015-08-01 00:00:00, 88838)
Data columns (total 5 columns):
CHANNEL object
MCC float64
DOMESTIC_FLAG category
AMOUNT float32
CNT uint8
dtypes: category(1), float32(1), float64(1), object(1), uint8(1)
memory usage: 784.6+ MB
流程功能只是做一些清理和类型分配
答案 0 :(得分:1)
考虑以下示例DataFrame:
In [93]: df1
Out[93]:
A B
0 a a
1 b b
2 c c
3 a a
In [94]: df2
Out[94]:
A B
0 b b
1 c c
2 d d
3 e e
In [95]: df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A 4 non-null object
B 4 non-null category
dtypes: category(1), object(1)
memory usage: 140.0+ bytes
In [96]: df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A 4 non-null object
B 4 non-null category
dtypes: category(1), object(1)
memory usage: 148.0+ bytes
注意:这两个DF有不同的类别:
In [97]: df1.B.cat.categories
Out[97]: Index(['a', 'b', 'c'], dtype='object')
In [98]: df2.B.cat.categories
Out[98]: Index(['b', 'c', 'd', 'e'], dtype='object')
当我们连接它们时,Pandas不会合并类别 - 它会创建一个object
列:
In [99]: m = pd.concat([df1, df2])
In [100]: m.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 0 to 3
Data columns (total 2 columns):
A 8 non-null object
B 8 non-null object
dtypes: object(2)
memory usage: 192.0+ bytes
但是如果我们将两个具有相同类别的DF连接在一起 - 一切都按预期工作:
In [102]: m = pd.concat([df1.sample(frac=.5), df1.sample(frac=.5)])
In [103]: m
Out[103]:
A B
3 a a
0 a a
3 a a
2 c c
In [104]: m.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 3 to 2
Data columns (total 2 columns):
A 4 non-null object
B 4 non-null category
dtypes: category(1), object(1)
memory usage: 92.0+ bytes