GroupBy方法更改数据类型

时间:2019-06-20 20:37:39

标签: pandas pandas-groupby

使用Python3和Anaconda,我在ipython上导入了pandas和os。我有一个非常大的csv文件。在文件上使用read_csv之后,我尝试在两列上使用.groupby(),但是它将数据类型从DataFrame更改为DataFrameGroupBy,并且我无法再在其上运行数据框架方法。

我想不起要尝试的任何东西。我从Codecademy获得的熊猫经验很少。我的代码似乎可以在那里工作。

2019-06-20 22:28:10,287  INFO kafka-producer-network-thread | producer-5 o.a.k.c.Metadata:285 - Cluster ID: 95g5Kjf7RoCKudHla5l7fA
2019-06-20 22:28:10,377 ERROR stream-table-sample-0178341b-1c0d-4f5a-b058-4d679303c87d-StreamThread-1 o.a.k.s.p.i.AssignedStreamsTasks:107 - stream-thread [stream-table-sample-0178341b-1c0d-4f5a-b058-4d679303c87d-StreamThread-1] Failed to process stream task 0_0 due to the following error:
java.lang.ClassCastException: [B cannot be cast to tki.bigdata.pojo.Contract
    at org.apache.kafka.streams.kstream.internals.KStreamKTableJoinProcessor.process(KStreamKTableJoinProcessor.java:73)
    at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50)
    at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244)
    at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:126)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90)
    at org.apache.kafka.streams.kstream.internals.KStreamMapValues$KStreamMapProcessor.process(KStreamMapValues.java:41)
    at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50)
    at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244)
    at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:126)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90)
    at org.apache.kafka.streams.kstream.internals.KStreamPassThrough$KStreamPassThroughProcessor.process(KStreamPassThrough.java:33)
    at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50)
    at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244)
    at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:122)
    at org.apache.kafka.streams.kstream.internals.KStreamBranch$KStreamBranchProcessor.process(KStreamBranch.java:48)
    at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50)
    at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244)
    at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:129)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90)
    at org.apache.kafka.streams.kstream.internals.KStreamMapValues$KStreamMapProcessor.process(KStreamMapValues.java:41)
    at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50)
    at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244)
    at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:126)
    at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90)
    at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:87)
    at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:302)
    at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94)
    at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:409)
    at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:964)
    at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:832)
    at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767)
    at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736)

我希望当我运行band_gaps.info()时,它将为我提供数据框的信息。相反,它给了我一个错误。当我检查band_gaps的类型时,它不再是数据框,而是DataFrameGroupBy。

1 个答案:

答案 0 :(得分:0)

如果您查看Pandas groupby documentation,则会发现它返回了Sub Concatenate_Cap1() Dim c As Range, rw As range, v3, v8 For Each c in Worksheets("PD Code Structure").Range("F2:F1006") v3 = c.EntireRow.cells(3).value v8 = c.EntireRow.cells(8).value If InStr(v3, "FS_Tier_") And InStr(v8, "FS_CAP_1_") Then c.value = v3 & " , " & v8 End If Next cell End Sub DataFrameGroupBy对象,具体取决于您是否在{{1}上调用了SeriesGroupBy }或.groupby。因此,您观察到的行为就不足为奇了。

更重要的是,熊猫为什么要这样做?好吧,在您的情况下,您要将一堆行组合在一起。熊猫可以保留已分组的DataFrame的某种表示形式,但不能执行其他任何操作(即,将其作为另一个Series返回给您),直到您应用了{{ 1}}或DataFrame。聚合函数获取每组行,并定义将该行转换为单行的某种方式。尝试将这些聚合函数之一应用于DataFrame,看看会发生什么。

例如:

.sum

在将所有行按.count分组之后,将返回一个band_gaps,表示每一列的平均值。

df.groupby('column1').mean()

在按DataFrame分组后,将返回一个column1df.groupby('column1')['column2'].sum() 中的值之和。请注意

Series

也有可能,但是在这种情况下,您要对所有列进行汇总后才选择感兴趣的列,这比进行汇总之前的切片要慢。