Spark:从DataFrame行中的路径列表中读取CSV文件

时间:2018-11-09 17:25:10

标签: python-3.x apache-spark pyspark

我有一个Spark DataFrame,如下所示:

<script src="https://cdn.jsdelivr.net/npm/lodash@4.17.11/lodash.min.js"></script>

第1至4列包含字符串,第五列包含字符串列表,它们实际上是我希望读取为Spark Dataframe的CSV文件的路径。反正我找不到阅读它们。这是一个简化的版本,仅包含一个列和包含路径列表的列:

# ---------------------------------
# - column 1 - ...  -   column 5  -
# ---------------------------------
# - ...             - Array of paths

这给出了:

from pyspark.sql import SparkSession,Row

spark = SparkSession \
        .builder \
        .appName('test') \
        .getOrCreate()

simpleRDD = spark.sparkContext.parallelize(range(10))
simpleRDD = simpleRDD.map(lambda x: Row(**{'a':x,'paths':['{}_{}.csv'.format(y**2,y+1) for y in range(x+1)]}))

simpleDF = spark.createDataFrame(simpleRDD)
print(simpleDF.head(5))

然后我想做这样的事情:

[Row(a=0, paths=['0_1.csv']),  
 Row(a=1, paths=['0_1.csv', '1_2.csv']),  
 Row(a=2, paths=['0_1.csv', '1_2.csv', '4_3.csv']),  
 Row(a=3, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv']),  
 Row(a=4, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv', '16_5.csv'])]

...但是这当然行不通。

2 个答案:

答案 0 :(得分:0)

from pyspark.sql import SparkSession,Row

from pyspark.sql.types import *

spark = SparkSession \
        .builder \
        .appName('test') \
        .getOrCreate()

inp=[['a','b','c','d',['abc\t1.txt','abc\t2.txt','abc\t3.txt','abc\t4.txt','abc\t5.txt',]],
            ['f','g','h','i',['def\t1.txt','def\t2.txt','def\t3.txt','def\t4.txt','def\t5.txt',]],
            ['k','l','m','n',['ghi\t1.txt','ghi\t2.txt','ghi\t3.txt','ghi\t4.txt','ghi\t5.txt',]]
           ]

inp_data=spark.sparkContext.parallelize(inp)

##Defining the schema

schema = StructType([StructField('field1',StringType(),True),
                      StructField('field2',StringType(),True),
                      StructField('field3',StringType(),True),
                      StructField('field4',StringType(),True),
                      StructField('field5',ArrayType(StringType(),True))
                     ])

## Create the Data frames

dataframe=spark.createDataFrame(inp_data,schema)
dataframe.createOrReplaceTempView("dataframe")
dataframe.select("field5").filter("field1='a'").show()

答案 1 :(得分:0)

我不确定从DataFrame对象的路径中读取它们后打算如何存储它们,但是如果要访问{{1 }}列中,可以使用DataFrame方法将.collect()作为DataFrame对象的列表返回(就像Row一样)。

每个RDD对象都有一个Row方法,可将其转换为Python .asDict()对象。到达该目录后,您可以通过使用字典键为字典建立索引来访问这些值。

假设您满意将返回的dictionary存储在列表中,则可以尝试以下操作:

DataFrames

您的# collect the DataFrame into a list of Rows rows = simpleRDD.collect() # collect all the values in your `paths` column # (note that this will return a list of lists) paths = map(lambda row: row.asDict().get('paths'), rows) # flatten the list of lists paths_flat = [path for path_list in paths for path in path_list] # get the unique set of paths paths_unique = list(set(paths_flat)) # instantiate an empty dictionary in which to collect DataFrames dfs_dict = [] for path in paths_unique: dfs_dict[path] = spark.read.csv(path) 现在将包含您所有的dfs_dict。要获取特定路径的DataFrames,可以使用该路径作为字典键来访问它:

DataFrame