我有一个从CSV读取的数据框,如下所示
df1=
category value Referece value
count 1 1
n_timer 20 40,20
frames 54 56
timer 8 3,6,7
pdf 99 100,101,22
zip 10 10,11,12
但是它将列作为长类型和字符串类型读取,但我想要数组类型(LongType),以便我可以交叉这些列并获得输出。
我想读取数据框,如下所示:
category value Referece value
count [1] [1]
n_timer [20] [40,20]
frames [54] [56]
timer [8] [3,6,7]
pdf [99] [100,101,22]
zip [10] [10,11,12]
请提出一些解决方案
答案 0 :(得分:-2)
# check below code
from pyspark import SparkContext
from pyspark.sql.functions import split
sc = SparkContext.getOrCreate()
df1 = sc.parallelize([("count","1","1"), ("n_timer","20","40,20"), ("frames","54","56"),("timer","8","3,6,7"),("pdf","99","100,101,22"),("zip","10","10,11,12")]).toDF(["category", "value","Reference_value"])
print(df1.show())
df1=df1.withColumn("Reference_value", split("Reference_value", ",\s*").cast("array<long>"))
df1=df1.withColumn("value", split("value", ",\s*").cast("array<long>"))
print(df1.show())
Input df1=
+--------+-----+---------------+
|category|value|Reference_value|
+--------+-----+---------------+
| count| 1| 1|
| n_timer| 20| 40,20|
| frames| 54| 56|
| timer| 8| 3,6,7|
| pdf| 99| 100,101,22|
| zip| 10| 10,11,12|
+--------+-----+---------------+
output df2=
+--------+-----+---------------+
|category|value|Reference_value|
+--------+-----+---------------+
| count| [1]| [1]|
| n_timer| [20]| [40, 20]|
| frames| [54]| [56]|
| timer| [8]| [3, 6, 7]|
| pdf| [99]| [100, 101, 22]|
| zip| [10]| [10, 11, 12]|
+--------+-----+---------------+
答案 1 :(得分:-2)
将带有值和引用列的编码器类编写为数组类型..
如何在JAVA中做: 数据集sampleDim = sqlContext.read()。csv(filePath).as(Encoders.bean(sample.class));
您可以在Python中尝试相同的方式