Question

I am newbie in pyspark, and I'm trying to read and merge RDD rows into one row.

Assuming that I have the following text file:

A1 B1 C1
A2 B2 C2 D3
A3 X1 YY1
DELIMITER_ROW
Z1 B1 C1 Z4
X2 V2 XC2 D3
DELIMITER_ROW
T1 R1
M2 MB2 NC2
S3 BB1 
AQ3 Q1 P1"

Now, I want to combine all rows appears in each section (between DELIMITER_ROW) into one row, and return a list of these merged rows.

I want to create this kind of list:

[[A1 B1 C1 A2 B2 C2 D3 A3 X1 YY1]
 [Z1 B1 C1 Z4 X2 V2 XC2 D3]
 [T1 R1 M2 MB2 NC2 S3 BB1 AQ3 Q1 P1]]

How can It be done in pyspark using RDD?

For now I know how to read the file and filter out the delimiter rows:

sc.textFile(pathToFile).filter(lambda line: DELIMITER_ROW not in line).collect()

but I don't know how to reduce/merge/combine/group the rows in each section into one row.

Thanks.

Answer 1

您可以使用<ion-header > <ion-navbar color="custom-color"> <img src="../../assets/img/Header.png"> </ion-navbar> </ion-header>设置分隔行然后拆分行的分隔符，而不是阅读和拆分。

hadoopConfiguration.set

希望这有帮助！