我在使用Spark数据框时遇到了一些奇怪的行为。这是Spark版本2.4.0(之前我曾以为是2.6这样编辑过的)
更新:
我从Spark中的“ csv”类型文件中读取了数据,我为特定数据集运行了一个查询,然后找到了它。 然后,我将该数据帧保存到镶木地板中,并且丢失了一些数据。但不是全部。
这是我的代码:
from pyspark.sql.types import *
# in readschema I provide a StructType object with the right schema
import readschema as rs
os.environ['PYTHONIOENCODING'] = "UTF-8"
READ_ENCODING = "iso-8859-1"
delim = "\t"
null_value = ""
schema = rs.FILE
file = "hdfs://path/to/file"
df = spark.read.schema(schema) \
.options(header="TRUE") \
.option("encoding", READ_ENCODING) \
.option("mode", "DROPMALFORMED") \
.option("delimiter", delim) \
.option("nullValue", null_value) \
.csv(file)
df.show()
# looks good, no problems in schema visible
# (otherwise dropmalformed option would drop alot, i guess)
如果我运行df.show,一切看起来都会很好。然后,我运行查询并保存该数据帧,再次阅读并感到困惑。
# i mask the actual data
# if there is inconsistency in key, value probably due to
# masking, i hope :)
>>> def execute_spark_sql(df, query, view_name="TB_TEMP"):
... df.createOrReplaceTempView(view_name)
... return spark.sql(query)
>>> execute_spark_sql(df, "Select cell_id, cell_name from TB_TEMP where cell_name='key'").show()
+--------------+---------+
| cell_id|cell_name|
+--------------+---------+
|xxxxxxxxxxxxxx| key |
+--------------+---------+
>>> df.write.mode("overwrite").format("parquet") \
... .option("header", "true").save("/path/to/debug/df")
>>> new_df = spark.read.parquet("/path/to/debug/df")
>>> new_df.show()
# output of show() looks good, there is sufficient data to not get suspicious
>>> execute_spark_sql(new_df, "Select cell_id, cell_name from TB_TEMP where cell_name='key'").show()
+-------+---------+
|cell_id|cell_name|
+-------+---------+
+-------+---------+
# but another key is found:
>>> execute_spark_sql(new_df, "Select cell_id, cell_name from TB_TEMP where cell_name='key2'").show()
+--------------+---------+
| cell_id|cell_name|
+--------------+---------+
|zzzzzzzzzzzz | key2 |
+--------------+---------+
>>> execute_spark_sql(df, "Select cell_id, cell_name from TB_TEMP where cell_name='key2'").show()
+--------------+---------+
| cell_id|cell_name|
+--------------+---------+
|zzzzzzzzzzzz | key2 |
+--------------+---------+
所以我认为,在保存数据帧时,加入数据后发生的问题(如下面的原始文字所述)实际上是数据丢失。
这些数据在某种程度上对我来说是随机的,但对于计算机而言却不是,因为它们是丢失的相同密钥。 在运行df.distinct()时,我也会丢失行,但不会删除重复的条目,但会丢失特定键的所有条目。
它实际上松开了相同的键:
>>> dist_df = df.distinct()
>>> dist_df.show()
# here the output again looks unsuspicious
>>> execute_spark_sql(dist_df, "Select cell_id, cell_name from TB_TEMP where cell_name='key2'").show()
+--------------+---------+
| cell_id|cell_name|
+--------------+---------+
|zzzzzzzzzzz | key2 |
+--------------+---------+
>>> execute_spark_sql(dist_df, "Select cell_id, cell_name from TB_TEMP where cell_name='key'").show()
+-------+---------+
|cell_id|cell_name|
+-------+---------+
+-------+---------+
所以有一个链接,但我真的不明白。在保存为实木复合地板之前,spark是否在数据帧上运行不同的数据(实际上,我已测试过保存到csv的功能,在这里我也松动了数据)。
感谢和BR teaVeloper
旧版本(出于完整性考虑):
所以我正在阅读大约12个csv文件,并将它们加入一个数据帧。在我读取的第一个CSV文件中,A列中有一些具有键值“键”的特定数据 例如
从数据框中选择key,其中key = key; 将返回
| key | A |
|-----|--- |
|"key" | value |
之后,我将在关键列上加入所有其他CSV文件。
它很好用,每次加入后我都输出查询,这很好。我对数据框进行了其他一些转换,我执行了一些窗口函数,并且还连接了另一个键。 仍然总是在我查询时,这将返回正确的值。
然后我将数据帧保存到一个实木复合地板文件中,就在我输出查询之后,就很好了。但是在我的实木复合地板文件中,这个特定的查询结果导致
| key | A |
|-----|--- |
|"key" | null |
我不会丢失所有数据,只是一些数据-不是真正随机的,因为在每次计算中,它总是丢失相同的键,但是同一文件中的其他键没有这种行为。我无法正确查看丢失的密钥与它们之间的csv文件之间的关系,而没有。
我还注意到的是,对于特定文件,尤其是第一个文件,当我运行dataframe.distinct()时,它拥有丢失的键和值对。我将丢失相同或更多的数据。 / p>
我希望我的描述清楚吗?
这里有一些代码
# i changed some domain specific parts of the code, i hope i didn't break it through this
# process
def attributes(left_dataframe, right_dataframe):
left_columns = left_dataframe.columns
right_columns = right_dataframe.columns
common_attributes = list(frozenset(left_columns).intersection(right_columns))
right_attributes = list(set(right_columns) - set(common_attributes))
left_attributes = list(set(left_columns) - set(common_attributes))
return {"intersect": common_attributes, "right": right_attributes, "left": left_attributes}
def join_querie_builder(left, right, left_name, right_name):
atts = attributes(left, right)
quer = "SELECT left.key as key "
for att in atts["intersect"]:
if att not in ["key"]:
quer = quer + ", CASE WHEN left." + att + " is NULL then right." + \
att + " else left." + att + " END as " + att
for att in atts["right"]:
quer = quer + ", right." + att
for att in atts["left"]:
if att not in {"key"}:
quer = quer + ", left." + att
quer = quer + " FROM " + left_name + " as left LEFT JOIN " + \
right_name + " AS right ON left.key = right.key"
return quer
# schemaswitcher is imported from a seperate code file (as rs for read_schema ), which holds
# all proper schema definitions for each of the files
READ_ENCODING = "iso-8859-1"
for id, file in enumerate(files):
filename = file.split('/')[-1].split(".")[0]
schema = schemaSwitcher.get(filename, rs.DEFAULT)
if file.endswith(".gz"):
log_info("working on " + file + " ...")
if file.endswith("txt.gz"):
delim = "\t"
null_value = ""
else:
delim = ";"
null_value = " "
df = spark.read.schema(schema) \
.options(header="TRUE") \
.option("encoding", READ_ENCODING) \
.option("mode", "DROPMALFORMED") \
.option("delimiter", delim) \
.option("nullValue", null_value) \
.csv(file)
# do some transformations
df_switcher_detail[filename] = df
def join_to_left(left_df, switcher, right_name):
left_t_name = "TB_RAN"
right_t_name = "T_FILE"
log_info("joining " + name)
temp = switcher[right_name]
left_df.createOrReplaceTempView(left_t_name)
temp.createOrReplaceTempView(right_t_name)
sparkQuery = join_querie_builder(left_df, temp, left_t_name, right_t_name)
return spark.sql(sparkQuery)
for name in listData:
ran_detail = join_to_left(ran_detail, df_switcher_detail, name)
# some more transformations happen, but they don't break
# it will be to verbose i guess to write all
# but there is also window functions that are used to create sub_ids for non-unique rows
# regarding the key (in all columns they are unique)
# execute_spark_sql is just a wrapper around spark.sql, whith creating a temp view, etc.
# returning a new dataframe
execute_spark_sql(ran_detail, "Select A, B, C, D, key from TB_TEMP where key='key'").show()
# this executions shows the values properly
ran_detail.write.repartition(4).mode("overwrite").format("parquet")
.option("header", "true").save("hdfs://path/to/file/")
该文件已正确保存并包含大多数数据,但是如果我进入pyspark控制台,例如读取该文件并运行保存之前的查询,我将获得包含该键和NULL的行对我来说很重要的A列。
我真的为这种行为感到困惑。
感谢和BR teaVeloper