我有一个Spark DataFrame,如下所示:
<script src="https://cdn.jsdelivr.net/npm/lodash@4.17.11/lodash.min.js"></script>
第1至4列包含字符串,第五列包含字符串列表,它们实际上是我希望读取为Spark Dataframe的CSV文件的路径。反正我找不到阅读它们。这是一个简化的版本,仅包含一个列和包含路径列表的列:
# ---------------------------------
# - column 1 - ... - column 5 -
# ---------------------------------
# - ... - Array of paths
这给出了:
from pyspark.sql import SparkSession,Row
spark = SparkSession \
.builder \
.appName('test') \
.getOrCreate()
simpleRDD = spark.sparkContext.parallelize(range(10))
simpleRDD = simpleRDD.map(lambda x: Row(**{'a':x,'paths':['{}_{}.csv'.format(y**2,y+1) for y in range(x+1)]}))
simpleDF = spark.createDataFrame(simpleRDD)
print(simpleDF.head(5))
然后我想做这样的事情:
[Row(a=0, paths=['0_1.csv']),
Row(a=1, paths=['0_1.csv', '1_2.csv']),
Row(a=2, paths=['0_1.csv', '1_2.csv', '4_3.csv']),
Row(a=3, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv']),
Row(a=4, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv', '16_5.csv'])]
...但是这当然行不通。
答案 0 :(得分:0)
from pyspark.sql import SparkSession,Row
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName('test') \
.getOrCreate()
inp=[['a','b','c','d',['abc\t1.txt','abc\t2.txt','abc\t3.txt','abc\t4.txt','abc\t5.txt',]],
['f','g','h','i',['def\t1.txt','def\t2.txt','def\t3.txt','def\t4.txt','def\t5.txt',]],
['k','l','m','n',['ghi\t1.txt','ghi\t2.txt','ghi\t3.txt','ghi\t4.txt','ghi\t5.txt',]]
]
inp_data=spark.sparkContext.parallelize(inp)
##Defining the schema
schema = StructType([StructField('field1',StringType(),True),
StructField('field2',StringType(),True),
StructField('field3',StringType(),True),
StructField('field4',StringType(),True),
StructField('field5',ArrayType(StringType(),True))
])
## Create the Data frames
dataframe=spark.createDataFrame(inp_data,schema)
dataframe.createOrReplaceTempView("dataframe")
dataframe.select("field5").filter("field1='a'").show()
答案 1 :(得分:0)
我不确定从DataFrame
对象的路径中读取它们后打算如何存储它们,但是如果要访问{{1 }}列中,可以使用DataFrame
方法将.collect()
作为DataFrame
对象的列表返回(就像Row
一样)。
每个RDD
对象都有一个Row
方法,可将其转换为Python .asDict()
对象。到达该目录后,您可以通过使用字典键为字典建立索引来访问这些值。
假设您满意将返回的dictionary
存储在列表中,则可以尝试以下操作:
DataFrames
您的# collect the DataFrame into a list of Rows
rows = simpleRDD.collect()
# collect all the values in your `paths` column
# (note that this will return a list of lists)
paths = map(lambda row: row.asDict().get('paths'), rows)
# flatten the list of lists
paths_flat = [path for path_list in paths for path in path_list]
# get the unique set of paths
paths_unique = list(set(paths_flat))
# instantiate an empty dictionary in which to collect DataFrames
dfs_dict = []
for path in paths_unique:
dfs_dict[path] = spark.read.csv(path)
现在将包含您所有的dfs_dict
。要获取特定路径的DataFrames
,可以使用该路径作为字典键来访问它:
DataFrame