从列表创建pyspark数据框?

时间:2019-11-22 16:35:00

标签: pyspark

所以我有一个标题列表,例如

Headers=["col1", "col2", "col3"]

和行列表

Body=[ ["val1", "val2", "val3"], ["val1", "val2", "val3"] ]

其中val1对应于应在col1项下的值。

如果我尝试createDataFrame(data=Body),则会出现错误cant infer schmea type for str

是否可以将这样的列表添加到pyspark数据框中?

我尝试将标头附加到正文中,例如

body.append(header),然后使用创建数据框函数,但会引发此错误:

field _22: Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.LongType'>

这是我用于生成正文和标头的整个代码:

基本上,我使用openpyxl读取excel文件,该文件会跳过前x行ect,仅读取具有某些列名的工作表。

生成正文和标题后,我想直接将其读入spark。

我们有一个承包商将其以csv的形式写出,然后使用spark进行阅读,但直接将其放入spark似乎更有意义。

我现在希望列都为字符串

import csv
from os import sys
excel_file = "/dbfs/{}".format(path)
wb = load_workbook(excel_file, read_only=True)  
sheet_names = wb.get_sheet_names()
sheets = spark.read.option("multiline", "true").format("json").load(configPath)
if dataFrameContainsColumn(sheets, "sheetNames"):
  config_sheets = jsonReader(configFilePath,"sheetNames")
else: 
  config_sheets= []
skip_rows=-1  
#get a list of the required columns
required_fields_list = jsonReader(configFilePath,"requiredColumns")
for worksheet_name in sheet_names:

  count=0
  sheet_count=0
  second_break=False
  worksheet = wb.get_sheet_by_name(worksheet_name)
      #assign the sheet name to the object sheet


  #create empty header and body lists for each sheet
  header = []
  body = []
  #for each row in the sheet we need to append the cells to the header and body 
  for i,row in enumerate(worksheet.iter_rows()):
   #if the row index is greater then skip rows then we want to read that row in as the header
    if i==skip_rows+1:             
      header.append([cell.value for cell in row])
    elif i>skip_rows+1: 
      count=count+1
      if count==1:
        header=header[0]
        header = [w.replace(' ', '_') for w in header]
        header = [w.replace('.', '') for w in header]

        if(all(elem in header for elem in required_fields_list)==False):
          second_break=True

          break

      else: 
        count=2
        sheet_count=sheet_count+1
        body.append([cell.value for cell in row])```

1 个答案:

答案 0 :(得分:0)

有几种方法可以从列表创建数据框。 您可以检出它们here

  1. 让Spark推断模式
list_of_persons = [('Arike', 28, 78.6), ('Bob', 32, 45.32), ('Corry', 65, 98.47)]
df = sc.parallelize(list_of_persons).toDF(['name', 'age', 'score'])
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- score: double (nullable = true)

+-----+---+-----+
| name|age|score|
+-----+---+-----+
|Arike| 28| 78.6|
|  Bob| 32|45.32|
|Corry| 65|98.47|
+-----+---+-----+
  1. 使用地图转换指定类型
list_of_persons = [('Arike', 28, 78.6), ('Bob', 32, 45.32), ('Corry', 65, 98.47)]
rdd = sc.parallelize(list_of_persons)
person = rdd.map(lambda x: Row(name=x[0], age=int(x[1]), score=float(x[2])))
schemaPeople = sqlContext.createDataFrame(person)

schemaPeople.printSchema()
schemaPeople.show()

root
 |-- age: long (nullable = true) 
 |-- name: string (nullable = true)
 |-- score: double (nullable = true)

+---+-----+-----+
|age| name|score|
+---+-----+-----+
| 28|Arike| 78.6|
| 32|  Bob|45.32|
| 65|Corry|98.47|
+---+-----+-----+