Question

我有一个使用pandas读取的csv文件，我想在指定列中以块的形式拆分数据框：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

list_of_classes=[]
# Reading file
fileName = 'Training.csv'
df       = pd.read_csv(fileName)
classID  = df.iloc[:,-2]
len(classID)
df.iloc[0,-2]
for i in range(len(classID)):
    print(classID[i])
    if classID[i] not in list_of_classes:
        list_of_classes.append(classID[i])


for i in range(len(df)):
  ...............................

更新

假设数据框如下：

........................................
Feature0  Feature1  Feature2  Feature3  ......... classID lastColum 


 190       565     35474  0.336283   2.973684       255         0   
 311       984    113199  0.316057   3.163987       155         0   
 310       984     94197  0.315041   3.174194      1005         0   
 280       984    116359  0.284553   3.514286       255        18   
 249       984    107482  0.253049   3.951807      1005         0   
 283       984    132343  0.287602   3.477032       155         0   
 213       984     88244  0.216463   4.619718       255         0   
 839       984    203139  0.852642   1.172825       255         0   
 376       984    105133  0.382114   2.617021      1005         0   
 324       984    129209  0.329268   3.037037      1005         0

在这个例子中，我打算得到的结果是3个数据帧，每个数据帧只有1个classID 155,1005或255。我的问题是，有没有更好的方法来做到这一点？

Answer 1

拆分为3个单独的CSV文件：

df.groupby('classID') \
  .apply(lambda x: x.to_csv(r'c:/temp/{}.csv'.format(x.name), index=False))

生成“splitted”DataFrames字典：

In [210]: dfs = {g:x for g,x in df.groupby('classID')}

In [211]: dfs.keys()
Out[211]: dict_keys([155, 255, 1005])

In [212]: dfs[155]
Out[212]:
   Feature0  Feature1  Feature2  Feature3  classID  lastColum
1       311       984    113199  0.316057      155          0
5       283       984    132343  0.287602      155          0

In [213]: dfs[255]
Out[213]:
   Feature0  Feature1  Feature2  Feature3  classID  lastColum
0       190       565     35474  0.336283      255          0
3       280       984    116359  0.284553      255         18
6       213       984     88244  0.216463      255          0
7       839       984    203139  0.852642      255          0

In [214]: dfs[1005]
Out[214]:
   Feature0  Feature1  Feature2  Feature3  classID  lastColum
2       310       984     94197  0.315041     1005          0
4       249       984    107482  0.253049     1005          0
8       376       984    105133  0.382114     1005          0
9       324       984    129209  0.329268     1005          0

Answer 2

以下是如何执行此操作的示例：

import pandas as pd

df = pd.DataFrame({'A': list('abcdef'), 'part': [1, 1, 1, 2, 2, 2]})

parts = df.part.unique()

for part in parts:
    print df.loc[df.part == part]

所以关键是你通过在要用于拆分的系列上调用unique()来获取所有独特的部分。

之后，您可以通过循环访问这些部分，并在每个部分上执行任何操作。

在pandas中溢出的数据框

2 个答案: