保存后火花从数据帧中丢失数据

时间:2020-10-05 23:46:09

标签: python apache-spark pyspark apache-spark-sql pyspark-dataframes

我在使用Spark数据框时遇到了一些奇怪的行为。这是Spark版本2.4.0(之前我曾以为是2.6这样编辑过的)

更新:

我从Spark中的“ csv”类型文件中读取了数据,我为特定数据集运行了一个查询,然后找到了它。 然后,我将该数据帧保存到镶木地板中,并且丢失了一些数据。但不是全部。

这是我的代码:

from pyspark.sql.types import *
# in readschema I provide a StructType object with the right schema
import readschema as rs

os.environ['PYTHONIOENCODING'] = "UTF-8"
READ_ENCODING = "iso-8859-1"



delim = "\t"
null_value = ""
schema = rs.FILE
file = "hdfs://path/to/file"

df = spark.read.schema(schema) \
            .options(header="TRUE") \
            .option("encoding", READ_ENCODING) \
            .option("mode", "DROPMALFORMED") \
            .option("delimiter", delim) \
            .option("nullValue", null_value) \
            .csv(file)
df.show()
# looks good, no problems in schema visible 
# (otherwise dropmalformed option would drop alot, i guess)

如果我运行df.show,一切看起来都会很好。然后,我运行查询并保存该数据帧,再次阅读并感到困惑。

# i mask the actual data
# if there is inconsistency in key, value probably due to
# masking, i hope :)

>>> def execute_spark_sql(df, query, view_name="TB_TEMP"):
...     df.createOrReplaceTempView(view_name)
...     return spark.sql(query)
>>> execute_spark_sql(df, "Select cell_id, cell_name from TB_TEMP where cell_name='key'").show()
+--------------+---------+
|       cell_id|cell_name|
+--------------+---------+
|xxxxxxxxxxxxxx| key     |
+--------------+---------+
>>> df.write.mode("overwrite").format("parquet") \
...     .option("header", "true").save("/path/to/debug/df")
>>> new_df = spark.read.parquet("/path/to/debug/df")
>>> new_df.show()
# output of show() looks good, there is sufficient data to not get suspicious


>>> execute_spark_sql(new_df, "Select cell_id, cell_name from TB_TEMP where cell_name='key'").show()
+-------+---------+
|cell_id|cell_name|
+-------+---------+
+-------+---------+

# but another key is found:

>>> execute_spark_sql(new_df, "Select cell_id, cell_name from TB_TEMP where cell_name='key2'").show()
+--------------+---------+
|       cell_id|cell_name|
+--------------+---------+
|zzzzzzzzzzzz  |  key2   |
+--------------+---------+

>>> execute_spark_sql(df, "Select cell_id, cell_name from TB_TEMP where cell_name='key2'").show()
+--------------+---------+
|       cell_id|cell_name|
+--------------+---------+
|zzzzzzzzzzzz  |  key2   |
+--------------+---------+

所以我认为,在保存数据帧时,加入数据后发生的问题(如下面的原始文字所述)实际上是数据丢失。

这些数据在某种程度上对我来说是随机的,但对于计算机而言却不是,因为它们是丢失的相同密钥。 在运行df.distinct()时,我也会丢失行,但不会删除重复的条目,但会丢失特定键的所有条目。

它实际上松开了相同的键:


>>> dist_df = df.distinct()
>>> dist_df.show()
# here the output again looks unsuspicious

>>> execute_spark_sql(dist_df, "Select cell_id, cell_name from TB_TEMP where cell_name='key2'").show()
+--------------+---------+
|       cell_id|cell_name|
+--------------+---------+
|zzzzzzzzzzz   |  key2   |
+--------------+---------+

>>> execute_spark_sql(dist_df, "Select cell_id, cell_name from TB_TEMP where cell_name='key'").show()
+-------+---------+
|cell_id|cell_name|
+-------+---------+
+-------+---------+

所以有一个链接,但我真的不明白。在保存为实木复合地板之前,spark是否在数据帧上运行不同的数据(实际上,我已测试过保存到csv的功能,在这里我也松动了数据)。

感谢和BR teaVeloper


旧版本(出于完整性考虑):

所以我正在阅读大约12个csv文件,并将它们加入一个数据帧。在我读取的第一个CSV文件中,A列中有一些具有键值“键”的特定数据 例如

从数据框中选择key,其中key = key; 将返回

| key | A      |
|-----|---     |
|"key" | value |

之后,我将在关键列上加入所有其他CSV文件。

它很好用,每次加入后我都输出查询,这很好。我对数据框进行了其他一些转换,我执行了一些窗口函数,并且还连接了另一个键。 仍然总是在我查询时,这将返回正确的值。

然后我将数据帧保存到一个实木复合地板文件中,就在我输出查询之后,就很好了。但是在我的实木复合地板文件中,这个特定的查询结果导致

| key | A      |
|-----|---     |
|"key" | null  |

我不会丢失所有数据,只是一些数据-不是真正随机的,因为在每次计算中,它总是丢失相同的键,但是同一文件中的其他键没有这种行为。我无法正确查看丢失的密钥与它们之间的csv文件之间的关系,而没有。

我还注意到的是,对于特定文件,尤其是第一个文件,当我运行dataframe.distinct()时,它拥有丢失的键和值对。我将丢失相同或更多的数据。 / p>

我希望我的描述清楚吗?

这里有一些代码


# i changed some domain specific parts of the code, i hope i didn't break it through this 
# process

def attributes(left_dataframe, right_dataframe):
    left_columns = left_dataframe.columns
    right_columns = right_dataframe.columns
    common_attributes = list(frozenset(left_columns).intersection(right_columns))
    right_attributes = list(set(right_columns) - set(common_attributes))
    left_attributes = list(set(left_columns) - set(common_attributes))
    return {"intersect": common_attributes, "right": right_attributes, "left": left_attributes}

def join_querie_builder(left, right, left_name, right_name):
    atts = attributes(left, right)
    quer = "SELECT left.key as key "
    for att in atts["intersect"]:
        if att not in ["key"]:
            quer = quer + ", CASE WHEN left." + att + " is NULL then right." + \
                   att + " else left." + att + " END as " + att
    for att in atts["right"]:
        quer = quer + ", right." + att
    for att in atts["left"]:
        if att not in {"key"}:
            quer = quer + ", left." + att
    quer = quer + " FROM " + left_name + " as left LEFT JOIN " + \
                           right_name + " AS right ON left.key = right.key"
    return quer

# schemaswitcher is imported from a seperate code file (as rs for read_schema ), which holds
# all proper schema definitions for each of the files
READ_ENCODING = "iso-8859-1"

for id, file in enumerate(files):
    filename = file.split('/')[-1].split(".")[0]
    schema = schemaSwitcher.get(filename, rs.DEFAULT)
    if file.endswith(".gz"):
        log_info("working on " + file + " ...")
        if file.endswith("txt.gz"):
            delim = "\t"
            null_value = ""
        else:
            delim = ";"
            null_value = " "

        df = spark.read.schema(schema) \
            .options(header="TRUE") \
            .option("encoding", READ_ENCODING) \
            .option("mode", "DROPMALFORMED") \
            .option("delimiter", delim) \
            .option("nullValue", null_value) \
            .csv(file)
        # do some transformations
        df_switcher_detail[filename] = df

def join_to_left(left_df, switcher, right_name):
    left_t_name = "TB_RAN"
    right_t_name = "T_FILE"
    log_info("joining " + name)
    temp = switcher[right_name]
    left_df.createOrReplaceTempView(left_t_name)
    temp.createOrReplaceTempView(right_t_name)
    sparkQuery = join_querie_builder(left_df, temp, left_t_name, right_t_name)
    return spark.sql(sparkQuery)

for name in listData:
    ran_detail = join_to_left(ran_detail, df_switcher_detail, name)

# some more transformations happen, but they don't break
# it will be to verbose i guess to write all
# but there is also window functions that are used to create sub_ids for non-unique rows
# regarding the key (in all columns they are unique)

# execute_spark_sql is just a wrapper around spark.sql, whith creating a temp view, etc. 
# returning a new dataframe

execute_spark_sql(ran_detail, "Select A, B, C, D, key from TB_TEMP where key='key'").show()
# this executions shows the values properly
ran_detail.write.repartition(4).mode("overwrite").format("parquet")
         .option("header", "true").save("hdfs://path/to/file/")

该文件已正确保存并包含大多数数据,但是如果我进入pyspark控制台,例如读取该文件并运行保存之前的查询,我将获得包含该键和NULL的行对我来说很重要的A列。

我真的为这种行为感到困惑。

感谢和BR teaVeloper

0 个答案:

没有答案