如何从另一个df值中补充一个df

时间:2020-09-12 19:44:34

标签: pyspark apache-spark-sql

我有2个数据帧,所以一个df具有格式正确的唯一值,而另一个df具有错误值的值,那么相对于另一个数据帧我如何用错误的值来补充df? >

示例:具有正确且唯一的值的df

+----------------------------------------+--------------+
|company_id                              |company_name  |
+----------------------------------------+--------------+
|8f642dc67fccf861548dfe1c761ce22f795e91f0|Muebles       |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy    |
+----------------------------------------+--------------+

具有错误值的示例df:

+----------------------------------------+------------+
|company_id                              |company_name|
+----------------------------------------+------------+
|*******                                 |MiPasajefy  |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|NaN         |
|NaN                                     |MiPasajefy  |
+----------------------------------------+------------+

列:company_id和company_name是关键列, 因此正确的df值必须正确:

+----------------------------------------+------------+
|company_id                              |company_name|
+----------------------------------------+------------+
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy  |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy  |
|cbf1c8b09cd5b549416d49d220a40cbd317f952e|MiPasajefy  |
+----------------------------------------+------------+

1 个答案:

答案 0 :(得分:1)

    from datetime import datetime
    from pyspark.sql import *
    from collections import *
    from pyspark.sql.functions import udf,explode
    from pyspark.sql.types import StringType
    table_schema = StructType([StructField('key1', StringType(), True),
                         StructField('key2', IntegerType(), True),
                         StructField('list1', ArrayType(StringType()), False),
                         StructField('list2', ArrayType(StringType()), False),
                         StructField('list3', ArrayType(IntegerType()), False),
                         StructField('list4', StringType(), False),
                         StructField('list5', ArrayType(FloatType()), False),
                         StructField('list6', ArrayType(StringType()), False)
                         ])
    df= spark.createDataFrame(
        [
    ("8f642dc67fccf861548dfe1c761ce22f795e91f0","Muebles"),
    ("cbf1c8b09cd5b549416d49d220a40cbd317f952e","MiPasajefy")
           
            ],("company_id","company_name")
        )
    
    df2= spark.createDataFrame(
        [
    (  "*****" ,"MiPasajefy"  ),
    ("cbf1c8b09cd5b549416d49d220a40cbd317f952e","NaN"         ),
    ("NaN","MiPasajefy")
    
    ],("company_id","company_name")
        )
    df.createOrReplaceTempView("A")
    df2.createOrReplaceTempView("B")
    spark.sql("select  a.Company_name,a.company_id from B b left join A a on (a.company_id=b.company_id or a.Company_name=b.Company_name )").show(truncate=False)

+------------+----------------------------------------+
|Company_name|company_id                              |
+------------+----------------------------------------+
|MiPasajefy  |cbf1c8b09cd5b549416d49d220a40cbd317f952e|
|MiPasajefy  |cbf1c8b09cd5b549416d49d220a40cbd317f952e|
|MiPasajefy  |cbf1c8b09cd5b549416d49d220a40cbd317f952e|
+------------+----------------------------------------+