Question

我想在pyspark.lf中读取json或xml文件，我的文件在多行中拆分

rdd= sc.textFIle(json or xml)

输入

{
" employees":
[
 {
 "firstName":"John",
 "lastName":"Doe" 
},
 { 
"firstName":"Anna"
  ]
}

输入分布在多行中。

预期输出{"employees:[{"firstName:"John",......]}

如何使用pyspark在一行中获取完整文件？

请帮助我，我是新手。

Answer 1

如果您的数据未按textFile预期在一行上形成，请使用wholeTextFile。这将为您提供全部内容，以便您可以将其解析为您想要的任何格式。

Answer 2

有3种方式（我发明了第3种，前两种是标准的内置Spark功能），这里的解决方案是在PySpark中：

textFile，wholeTextFile和标记的textFile（key = file，value =来自file的1行。这是解析文件的两种给定方法之间的混合）。

1。）textFile

输入： stringChild

输出：每个条目包含1行文件的数组，即。 [line1，line2，...]

2。）wholeTextFiles

输入： rdd = sc.textFile('/home/folder_with_text_files/input_file')

输出：元组数组，第一项是＆＃34;键＆＃34;使用文件路径，第二项包含1个文件的全部内容即。

[（你的文件：/ home / folder_with_text_files /＆＃39;，你＆＃39; file1_contents＆＃39;），（你＆＃39;文件：/ home / folder_with_text_files /＆＃39;，file2_contents），...]

3。）＆＃34;标记＆＃34;文本文件

输入：

rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')

output：包含元组的每个条目的数组，使用filename-as-key，其值为=每行文件。（从技术上讲，使用此方法，您还可以使用除实际文件路径名称之外的其他键 - 可能是哈希表示以节省内存）。即。

import glob from pyspark import SparkContext SparkContext.stop(sc) sc = SparkContext("local","example") # if running locally sqlContext = SQLContext(sc) for filename in glob.glob(Data_File + "/*"): Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)

您还可以重新组合作为一系列行：

[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'), ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'), ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'), ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'), ...]

Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

或者将整个文件重新组合回单个字符串（在此示例中，结果与从wholeTextFiles获得的结果相同，但使用字符串＆＃34; file：＆＃34;从文件路径中删除。）：

[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']), ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]

Answer 3

这就是你在scala中的表现

rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
rdd.collect.foreach(t=>println(t._2))

Answer 4

“如何在一个字符串中读取整个[HDFS]文件[在Spark中，用作sql]”：

e.g。

// Put file to hdfs from edge-node's shell...

hdfs dfs -put <filename>

// Within spark-shell...

// 1. Load file as one string
val f = sc.wholeTextFiles("hdfs:///user/<username>/<filename>")
val hql = f.take(1)(0)._2

// 2. Use string as sql/hql
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = hiveContext.sql(hql)

Answer 5

Python方式

rdd = spark.sparkContext.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
json = rdd.collect()[0][1]

如何在一个字符串中读取整个文件

5 个答案:

“如何在一个字符串中读取整个[HDFS]文件[在Spark中，用作sql]”：