如何在pyspark中使用许多条件的连接?

时间:2017-08-22 08:25:36

标签: python apache-spark spark-dataframe

我可以使用带有单一条件的数据帧连接语句(在pyspark中)但是,如果我尝试添加多个条件,那么它就失败了。

代码:

   The error for the statement 
   summary2 = summary.join(county_prop, ["category_id", (summary.bucket)==9], how = "leftouter")

   ERROR : TypeError: 'Column' object is not callable

以上代码有效。但是,如果我为列表添加一些其他条件,例如summary.bucket == 9或其他东西,则会失败。请帮我解决这个问题。

   schema = StructType([StructField("category", StringType()), StructField("category_id", StringType()), StructField("bucket", StringType()), StructField("prop_count", StringType()), StructField("event_count", StringType()), StructField("accum_prop_count",StringType())])
   bucket_summary = sqlContext.createDataFrame([],schema)

   temp_county_prop = sqlContext.createDataFrame([("nation","nation",1,222,444,555),("nation","state",2,222,444,555)],schema)
   bucket_summary = bucket_summary.unionAll(temp_county_prop)
   county_prop = sqlContext.createDataFrame([("nation","state",2,121,221,551)],schema)

编辑:

添加完整的工作示例。

   cond = [bucket_summary.bucket == county_prop.bucket, bucket_summary.bucket == 2]

想要加入:

category_id和bucket列,我想替换bucket_summary上的county_prop值。

   1. It works if I mention the whole statement with cols, but if I list conditions like ["category_id", "bucket"]  --- THis too works.

   2. But, if I use a combination of both like cond =["bucket", bucket_summary.category_id == "state"] 

bucket_summary2 = bucket_summary.join(county_prop,cond,how =" leftouter")

 <div class="modal-body">
                <form  method="post" action="m.php">
                    <div class="form-group">
                        <lable>Name</lable>
                        <input type="text" class="form-control" required="required">
                    </div>

                    <div class="form-group">
                        <lable>Email</lable>
                        <input type="text" class="form-control" required>
                    </div>

                    <div class="form-group">
                        <lable>Details</lable>
                        <input type="text" class="form-control" required>
                    </div>

                    <div class="form-group">
                        <lable>Message</lable>
                        <textarea name="" id=""  class="form-control"></textarea>
                    </div>
                     <div class="form-group">
                        <a href="" class="btn btn-default" name="submit" style=" background: #eee; width:100px;display: block;margin-left:auto;">Submit</a>
                    </div>
                </form>
            </div>

它不起作用。 2声明可能出现什么问题?

1 个答案:

答案 0 :(得分:2)

e.g。

df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')

但在您的情况下,(summary.bucket)==9不应显示为加入条件

更新:

加入条件中,您可以使用Column join expression 列表Column / column_name

列表