Question

我是Spark的初学者。请帮我解决一下。

csv文件包含key形式的文本：由逗号分隔的值配对。在某些行中，键（或列）可能会丢失。

我已将此文件加载到数据框的单个列中。我想将这些键作为与其关联的列和值作为数据分离到该列中。当有一些列缺失时，我想添加一个新列和一个虚拟数据。

Dataframe

 +----------------------------------------------------------------+
 |   _c0                                                          |
 +----------------------------------------------------------------+
 |name:Pradnya,IP:100.0.0.4, college: SDM, year:2018              |
 |name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018  |
 +----------------------------------------------------------------+

我想要这种形式的输出

  +----------- ----------------------------------------------
  |  name     |  IP         | College   |  Semester | year  |
  +-----------+-------------------------+-----------+-------+
  |  Pradnya  |100.0.0.4    |  SDM      |  null     | 2018  |
  +-----------+-------------+-----------+-----------+-------+
  |  Ram      | 100.10.10.5 | BVB       | IV        |2018   |
  +-----------+-------------+-----------+-----------+-------+

感谢。

Answer 1

Pyspark无法识别关键：价值配对。一种解决方法是转换文件int json格式，然后读取json文件。 raw.txt的内容：

name:Pradnya,IP:100.0.0.4, college: SDM, year:2018
name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018

以下代码将创建json文件：

with open('raw.json', 'w') as outfile:
  json.dump([dict([p.split(':') for p in l.split(',')]) for l in open('raw.txt')], outfile)

现在，您可以使用以下代码创建pyspark数据框：

df = spark.read.format('json').load('raw.json')

Answer 2

如果您知道所有字段名称和键/值都不包含嵌入分隔符。那么你可以通过RDD的map函数将键/值行转换为Row对象。

from pyspark.sql import Row
from string import lower

# assumed you already defined SparkSession named `spark`
sc = spark.sparkContext

# initialize the RDD
rdd = sc.textFile("key-value-file")

# define a list of all field names
columns = ['name', 'IP', 'College', 'Semester', 'year']

# set Row object
def setRow(x):
    # convert line into key/value tuples. strip spaces and lowercase the `k`
    z = dict((lower(k.strip()), v.strip()) for e in x.split(',') for k,v in [ e.split(':') ])
    # make sure all columns shown in the Row object
    return Row(**dict((c, z[c] if c in z else None) for c in map(lower, columns)))

# map lines to Row objects and then convert the result to dataframe
rdd.map(setRow).toDF().show()
#+-------+-----------+-------+--------+----+
#|college|         ip|   name|semester|year|
#+-------+-----------+-------+--------+----+
#|    SDM|  100.0.0.4|Pradnya|    null|2018|
#|    BVB|100.10.10.5|    Ram|      IV|2018|
#+-------+-----------+-------+--------+----+

将csv文件读入包含key：value配对的PySpark，使得key成为列，value是它的数据

2 个答案: