Question

我在Java应用程序中使用Apache Spark。我有两个DataFrame：df1和df2。 df1包含Row个email，firstName和lastName。 df2包含Row个email。

我想创建一个DataFrame：df3，其中包含df1中的所有行，df2中没有该电子邮件。

有没有办法用Apache Spark做到这一点？我尝试通过将JavaRDD<String>和df1投射df2并过滤toJavaRDD()以包含所有电子邮件，然后使用df1创建subtract，但我不知道如何将新JavaRDD映射到ds1并获得DataFrame。

基本上我需要df1中的电子邮件不在df2的所有行。

DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM customer ");

DataFrame customersWhoOrderedTheProduct = sqlContext.cassandraSql("SELECT email FROM customer_bought_product " +
                            "WHERE product_id = '" + productId + "'");

JavaRDD<String> customersBoughtEmail = customersWhoOrderedTheProduct.toJavaRDD().map(row -> row.getString(0));

List<String> notBoughtEmails = customers.javaRDD()
                        .map(row -> row.getString(0))
                        .subtract(customersBoughtEmail).collect();

Answer 1

Spark 2.0.0 +

您可以直接使用NOT IN。

Spark＆lt; 2.0.0

可以使用外连接和过滤来表示。

val customers = sc.parallelize(Seq( ("john@example.com", "John", "Doe"), ("jane@example.com", "Jane", "Doe") )).toDF("email", "first_name", "last_name") val customersWhoOrderedTheProduct = sc.parallelize(Seq( Tuple1("jane@example.com") )).toDF("email") val customersWhoHaventOrderedTheProduct = customers.join( customersWhoOrderedTheProduct.select($"email".alias("email_")), $"email" === $"email_", "leftouter") .where($"email_".isNull).drop("email_") customersWhoHaventOrderedTheProduct.show // +----------------+----------+---------+ // | email|first_name|last_name| // +----------------+----------+---------+ // |john@example.com| John| Doe| // +----------------+----------+---------+

等效的Raw SQL：

customers.registerTempTable("customers") customersWhoOrderedTheProduct.registerTempTable( "customersWhoOrderedTheProduct") val query = """SELECT c.* FROM customers c LEFT OUTER JOIN customersWhoOrderedTheProduct o ON c.email = o.email WHERE o.email IS NULL""" sqlContext.sql(query).show // +----------------+----------+---------+ // | email|first_name|last_name| // +----------------+----------+---------+ // |john@example.com| John| Doe| // +----------------+----------+---------+

Answer 2

我在python做过，除了我建议你使用整数作为键而不是字符串。

from pyspark.sql.types import *

samples = sc.parallelize([
    ("abonsanto@fakemail.com", "Alberto", "Bonsanto"), ("mbonsanto@fakemail.com", "Miguel", "Bonsanto"),
    ("stranger@fakemail.com", "Stranger", "Weirdo"), ("dbonsanto@fakemail.com", "Dakota", "Bonsanto")
])

keys = sc.parallelize(
    [("abonsanto@fakemail.com",), ("mbonsanto@fakemail.com",), ("dbonsanto@fakemail.com",)]
)

complex_schema = StructType([
    StructField("email", StringType(), True),
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True)
])

simple_schema = StructType([
    StructField("email", StringType(), True)
])

df1 = sqlContext.createDataFrame(samples, complex_schema)
df2 = sqlContext.createDataFrame(keys, simple_schema)

df1.show()
df2.show()

df3 = df1.join(df2, df1.email == df2.email, "left_outer").where(df2.email.isNull()).show()

如何在Apache Spark中为两个具有不同结构的DataFrame实现NOT IN

2 个答案: