数据框在多个列上连接,在pyspark中的列上有一些条件

时间:2018-05-25 11:15:44

标签: python apache-spark pyspark apache-spark-sql

df = sqlContext.sql("select d1.a, d1.b, d1.c as aaa, d2.d, d2.e, d2.f, d2.g, d2.h, d2.i, d2.j as length, '{1}' as month_end from df1 d1 join df2 d2 on concat(substr(upper(trim(d1.a)),0,d1.j),' ') = substr(upper(trim(d2.j)),0,(d2.j+1)) and upper(trim(d1.c)) = upper(trim(d2.f)) where length(upper(trim(d2.i))) > d2.j and length(upper(trim(d1.a))) = (d1.j+3)".format(dataBase, month_end))

有人可以帮我把上面的连接转换为数据帧连接而不是sql join。

尝试:

joinDf = df1.join(df2,on=[(concat(substring(upper(trim(df1["a"])),0,df1["j"]),' ')) == substring(upper(trim(df2["j"])),0,(df2["j"]+1)) and upper(trim(df1["c"])) == upper(trim(df2["f"]))])

(没有选择)

获取错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/cloudera/parcels/CDH-5.10.2-1.cdh5.10.2.p2667.3017/lib/spark/python/pyspark/sql/functions.py", line 1180, in substring
    return Column(sc._jvm.functions.substring(_to_java_column(str), pos, len))
  File "/opt/cloudera/parcels/CDH-5.10.2-1.cdh5.10.2.p2667.3017/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 798, in __call__
  File "/opt/cloudera/parcels/CDH-5.10.2-1.cdh5.10.2.p2667.3017/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 785, in _get_args
  File "/opt/cloudera/parcels/CDH-5.10.2-1.cdh5.10.2.p2667.3017/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_collections.py", line 512, in convert
TypeError: 'Column' object is not callable

1 个答案:

答案 0 :(得分:0)

您不能将函数用于平面类型(例如string)并将其应用于Column类型。 (substringuppertrim等需要更换)

您需要实现自己的UDF或使用pyspark.sql.functions模块中的函数: http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functions