我有两个名为 - brand_name和poi_name的数据框。
Dataframe 1(brand_name): -
+-------------+
|brand_stop[0]|
+-------------+
|TOASTMASTERS |
|USBORNE |
|ARBONNE |
|USBORNE |
|ARBONNE |
|ACADEMY |
|ARBONNE |
|USBORNE |
|USBORNE |
|PILLAR |
+-------------+
数据框2 :-( poi_name)
+---------------------------------------+
|Name |
+---------------------------------------+
|TOASTMASTERS DISTRICT 48 |
|USBORNE BOOKS AND MORE |
|ARBONNE |
|USBORNE BOOKS AT HOME |
|ARBONNE |
|ACADEMY, LTD. |
|ARBONNE |
|USBORNE BOOKS AT HOME |
|USBORNE BOOKS & MORE |
|PILLAR TO POST HOME INSPECTION SERVICES|
+---------------------------------------+
我想检查数据帧1的brand_stop列中的字符串是否存在于数据帧2的Name列中。匹配应该按行进行,然后如果成功匹配,则该特定记录应存储在新的柱。
我尝试使用Join: -
过滤数据框from pyspark.sql.functions import udf, col
from pyspark.sql.types import BooleanType
contains = udf(lambda s, q: q in s, BooleanType())
like_with_python_udf = (poi_names.join(brand_names1)
.where(contains(col("Name"), col("brand_stop[0]")))
.select(col("Name")))
like_with_python_udf.show()
但这显示错误
"AnalysisException: u'Detected cartesian product for INNER join between logical plans"
我是PySpark的新手。请帮帮我。
谢谢
答案 0 :(得分:1)
scala代码如下:
val d1 = Array(("TOASTMASTERS"),("USBORNE"),("ARBONNE"),("USBORNE"),("ARBONNE"),("ACADEMY"),("ARBONNE"),("USBORNE"),("USBORNE"),("PILLAR"))
val rdd1 = sc.parallelize(d1)
val df1 = rdd1.toDF("brand_stop")
val d2 = Array(("TOASTMASTERS DISTRICT 48"),("USBORNE BOOKS AND MORE"),("ARBONNE"),("USBORNE BOOKS AT HOME"),("ARBONNE"),("ACADEMY, LTD."),("ARBONNE"),("USBORNE BOOKS AT HOME"),("USBORNE BOOKS & MORE"),("PILLAR TO POST HOME INSPECTION SERVICES"))
val rdd2 =sc.parallelize(d2)
val df2 = rdd2.toDF("names")
def matchFunc(s1:String,s2:String) : Boolean ={
if(s2.contains(s1)) true
else false
}
val contains = udf(matchFunc _)
val like_with_python_udf = (df1.join(df2).where(contains(col("brand_stop"), col("names"))).select(col("brand_stop"), col("names")))
like_with_python_udf.show()
Python代码:
from pyspark.sql import Row
from pyspark.sql.functions import udf, col
from pyspark.sql.types import BooleanType
schema1 = Row("brand_stop")
schema2 = Row("names")
df1 = sc.parallelize([
schema1("TOASTMASTERS"),
schema1("USBORNE"),
schema1("ARBONNE")
]).toDF()
df2 = sc.parallelize([
schema2("TOASTMASTERS DISTRICT 48"),
schema2("USBORNE BOOKS AND MORE"),
schema2("ARBONNE"),
schema2("ACADEMY, LTD."),
schema2("PILLAR TO POST HOME INSPECTION SERVICES")
]).toDF()
contains = udf(lambda s, q: q in s, BooleanType())
like_with_python_udf = (df1.join(df2)
.where(contains(col("brand_stop"), col("names")))
.select(col("brand_stop"), col("names")))
like_with_python_udf.show()
我得到了输出:
+ ------------ + | brand_stop | + ------------ + |演讲| |新经典文化| |阿邦| + ------------ +
答案 1 :(得分:1)
匹配应该按行进行
在这种情况下,您必须添加某种形式的索引并加入
from pyspark.sql.types import *
def index(df):
schema = StructType(df.schema.fields + [(StructField("_idx", LongType()))])
rdd = df.rdd.zipWithIndex().map(lambda x: x[0] +(x[1], ))
return rdd.toDF(schema)
brand_name = spark.createDataFrame(["TOASTMASTERS", "USBORNE"], "string").toDF("brand_stop")
poi_name = spark.createDataFrame(["TOASTMASTERS DISTRICT 48", "USBORNE BOOKS AND MORE"], "string").toDF("poi_name")
index(brand_name).join(index(poi_name), ["_idx"]).selectExpr("*", "poi_name rlike brand_stop").show()
# +----+------------+--------------------+-------------------------+
# |_idx| brand_stop| poi_name|poi_name RLIKE brand_stop|
# +----+------------+--------------------+-------------------------+
# | 0|TOASTMASTERS|TOASTMASTERS DIST...| true|
# | 1| USBORNE|USBORNE BOOKS AND...| true|
# +----+------------+--------------------+-------------------------+