向pyspark Dataframe添加新行

时间:2018-10-07 05:03:13

标签: python apache-spark pyspark

是非常新的pyspark,但对熊猫很熟悉。 我有一个pyspark数据框

# instantiate Spark
spark = SparkSession.builder.getOrCreate()

# make some test data
columns = ['id', 'dogs', 'cats']
vals = [
     (1, 2, 0),
     (2, 0, 1)
]

# create DataFrame
df = spark.createDataFrame(vals, columns)

希望添加新的行(4,5,7),以便输出:

df.show()
+---+----+----+
| id|dogs|cats|
+---+----+----+
|  1|   2|   0|
|  2|   0|   1|
|  4|   5|   7|
+---+----+----+

3 个答案:

答案 0 :(得分:1)

正如thebluephantom所说的那样,联合是必经之路。我只是在回答您的问题,为您提供pyspark示例:

columns = ['id', 'dogs', 'cats']
vals = [(1, 2, 0), (2, 0, 1)]

df = spark.createDataFrame(vals, columns)

newRow = spark.createDataFrame([(4,5,7)], columns)
appended = df.union(newRow)
appended.show()

也请查看databricks常见问题解答:https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html

答案 1 :(得分:0)

根据我所做的使用 union 的操作,显示了块部分编码-当然,您需要适应自己的情况:

val dummySchema = StructType(
StructField("phrase", StringType, true) :: Nil)
var dfPostsNGrams2 = spark.createDataFrame(sc.emptyRDD[Row], dummySchema)
for (i <- i_grams_Cols) {
    val nameCol = col({i})
    dfPostsNGrams2 = dfPostsNGrams2.union(dfPostsNGrams.select(explode({nameCol}).as("phrase")).toDF )
 }
DF与自身的联合是要走的路。

答案 2 :(得分:0)

另一种替代方法是使用分区镶木地板格式,并为您要附加的每个数据帧添加一个额外的镶木地板文件。通过这种方式,您可以创建(数百、数千、数百万)parquet 文件,并且当您稍后阅读目录时,spark 会将它们作为联合读取。

这个例子使用了pyarrow

请注意,如果您已经知道要将单个 Parquet 文件放在哪里,我还展示了如何编写未分区的单个 Parquet (example.parquet)。

import pyarrow.parquet as pq
import pandas as pd

headers=['A', 'B', 'C']

row1 = ['a1', 'b1', 'c1']
row2 = ['a2', 'b2', 'c2']

df1 = pd.DataFrame([row1], columns=headers)
df2 = pd.DataFrame([row2], columns=headers)

df3 = df1.append(df2, ignore_index=True)


table = pa.Table.from_pandas(df3)

pq.write_table(table, 'example.parquet', flavor='spark')
pq.write_to_dataset(table, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')

# Adding a new partition (B=b2/C=c3


row3 = ['a3', 'b3', 'c3']
df4 = pd.DataFrame([row3], columns=headers)

table2 = pa.Table.from_pandas(df4)
pq.write_to_dataset(table2, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')

# Add another parquet file to the B=b2/C=c2 partition
# Note this does not overwrite existing partitions, it just appends a new .parquet file.
# If files already exist, then you will get a union result of the two (or multiple) files when you read the partition
row5 = ['a5', 'b2', 'c2']
df5 = pd.DataFrame([row5], columns=headers)
table3 = pa.Table.from_pandas(df5)
pq.write_to_dataset(table3, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')

之后读取输出

from pyspark.sql import SparkSession

spark = (SparkSession
         .builder
         .appName("testing parquet read")
         .getOrCreate())

df_spark = spark.read.parquet('test_part_file')
df_spark.show(25, False)

您应该会看到类似这样的内容

+---+---+---+
|A  |B  |C  |
+---+---+---+
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
+---+---+---+

如果你再次端到端地运行同样的事情,你应该看到这样的重复(因为所有以前的镶木地板文件仍然存在,火花联合它们)。

+---+---+---+
|A  |B  |C  |
+---+---+---+
|a2 |b2 |c2 |
|a5 |b2 |c2 |
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
|a3 |b3 |c3 |
+---+---+---+