列上的collect_list之后的PySpark reduceByKey聚合

时间:2017-11-23 09:43:20

标签: python apache-spark lambda pyspark

我想根据'状态'采取以下示例来进行聚合。由collect_list收集。

示例代码:

System.AggregateException: One or more errors occurred. ---> System.ArgumentOutOfRangeException: 索引超出範圍。必須為非負數且小於集合的大小。
Parameter name: index
   at System.ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument argument, ExceptionResource resource)
   at System.Collections.Generic.List`1.get_Item(Int32 index)
   at System.Data.RecordManager.NewRecordBase()
   at System.Data.DataTable.NewRecord(Int32 sourceRecord)
   at System.Data.DataRow.BeginEditInternal()
   at System.Data.DataRow.set_Item(DataColumn column, Object value)
   at System.Data.DataRow.set_Item(String columnName, Object value)
   at Sci.PamsPH.Payroll.PayrollP01.<ButOK_Click>b__6_1(DataRow r) in D:\SystemGit2017\PamsPH_Formal\Sci.PamsPH.Payroll\PayrollP01.cs:line 350
   at System.Threading.Tasks.Parallel.<>c__DisplayClass42_0`2.<PartitionerForEachWorker>b__1()
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at System.Threading.Tasks.Parallel.PartitionerForEachWorker[TSource,TLocal](Partitioner`1 source, ParallelOptions parallelOptions, Action`1 simpleBody, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](IEnumerable`1 source, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEach[TSource](IEnumerable`1 source, ParallelOptions parallelOptions, Action`1 body)
   at Sci.PamsPH.Payroll.PayrollP01.ButOK_Click(Object sender, EventArgs e) in D:\SystemGit2017\PamsPH_Formal\Sci.PamsPH.Payroll\PayrollP01.cs:line 201
   at System.Windows.Forms.Control.OnClick(EventArgs e)
   at System.Windows.Forms.Button.OnClick(EventArgs e)
   at Ict.Win.UI.Button.OnClick(EventArgs e)
   at Sci.Win.UI.Button.OnClick(EventArgs e) in D:\System2017\PublicClass\Sci\Win\UI\Button.cs:line 253
   at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
   at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
   at System.Windows.Forms.Control.WndProc(Message& m)
   at System.Windows.Forms.ButtonBase.WndProc(Message& m)
   at System.Windows.Forms.Button.WndProc(Message& m)
   at System.Windows.Forms.Control.ControlNativeWindow.OnMessage(Message& m)
   at System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)
   at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)
---> (Inner Exception #0) System.ArgumentOutOfRangeException: 索引超出範圍。必須為非負數且小於集合的大小。
Parameter name: index
   at System.ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument argument, ExceptionResource resource)
   at System.Collections.Generic.List`1.get_Item(Int32 index)
   at System.Data.RecordManager.NewRecordBase()
   at System.Data.DataTable.NewRecord(Int32 sourceRecord)
   at System.Data.DataRow.BeginEditInternal()
   at System.Data.DataRow.set_Item(DataColumn column, Object value)
   at System.Data.DataRow.set_Item(String columnName, Object value)
   at Sci.PamsPH.Payroll.PayrollP01.<ButOK_Click>b__6_1(DataRow r) in D:\SystemGit2017\PamsPH_Formal\Sci.PamsPH.Payroll\PayrollP01.cs:line 350
   at System.Threading.Tasks.Parallel.<>c__DisplayClass42_0`2.<PartitionerForEachWorker>b__1()
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )<---

我的代码:

states = sc.parallelize(["TX","TX","CA","TX","CA"])
states.map(lambda x:(x,1)).reduceByKey(operator.add).collect()
#printed output: [('TX', 3), ('CA', 2)]

我想要的是:

from pyspark import SparkContext,SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import collect_list
import operator
conf = SparkConf().setMaster("local")
conf = conf.setAppName("test")
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
rdd = sc.parallelize([('20170901',['TX','TX','CA','TX']), ('20170902', ['TX','CA','CA']), ('20170902',['TX']) ])
df = spark.createDataFrame(rdd, ["datatime", "actionlist"])
df = df.groupBy("datatime").agg(collect_list("actionlist").alias("actionlist"))

rdd = df.select("actionlist").rdd.map(lambda x:(x,1))#.reduceByKey(operator.add)
print (rdd.take(2))
#printed output: [(Row(actionlist=[['TX', 'CA', 'CA'], ['TX']]), 1 (Row(actionlist=[['TX', 'TX', 'CA', 'TX']]), 1)]
#for next step, it should look like:
#[(Row(actionlist=[('TX',1), ('CA',1), ('CA',1), ('TX',1)]), (Row(actionlist=[('TX',1), ('TX',1), ('CA',1), ('TX',1)])]

我认为第一步是压缩collect_list结果,我尝试过: udf(lambda x:list(chain.from_iterable(x)),StringType()) udf(lambda items:list(chain.from_iterable(itertools.repeat(x,1)if isinstance(x,str)else x for x in items))) udf(lambda l:[子列表中项目的子列表项目])

但是没有运气,下一步是化妆KV对并做减少,我在这里停留了一段时间,任何火花专家可以帮助逻辑吗?谢谢你的帮助!

2 个答案:

答案 0 :(得分:2)

你可以在udf中使用reduce和counter来实现它。我试过自己的方式,希望这会有所帮助。

>>> from functools import reduce
>>> from collections import Counter
>>> from pyspark.sql.types import *
>>> from pyspark.sql import functions as F
>>> rdd = sc.parallelize([('20170901',['TX','TX','CA','TX']), ('20170902', ['TX','CA','CA']), ('20170902',['TX']) ])
>>> df = spark.createDataFrame(rdd, ["datatime", "actionlist"])
>>> df = df.groupBy("datatime").agg(F.collect_list("actionlist").alias("actionlist"))
>>> def someudf(row):
        value = reduce(lambda x,y:x+y,row)
        return Counter(value).most_common()

>>> schema = ArrayType(StructType([
    StructField("char", StringType(), False),
    StructField("count", IntegerType(), False)]))

>>> udf1 = F.udf(someudf,schema)
>>> df.select('datatime',udf1(df.actionlist)).show(2,False)
+--------+-------------------+
|datatime|someudf(actionlist)|
+--------+-------------------+
|20170902|[[TX,2], [CA,2]]   |
|20170901|[[TX,3], [CA,1]]   |
+--------+-------------------+

答案 1 :(得分:2)

您只需使用combineByKey():

即可
   $(".button").click(function(e){
        e.preventDefault();
        $(this).hide();
        $(".slidethis").fadeIn(800).css("display","inline-block");
        $(".wrapper").css("display","block");
    });

  $(".seconddiv").hide();
    //forstdiv click 
    $(".firstdiv").click(function(){        
        $(this).hide();
        $(".seconddiv").show();
    });

    //Close button 
    $(".close_icon").click(function(){      
        $(".popup").show();
    });