所以我有一个标题列表,例如
Headers=["col1", "col2", "col3"]
和行列表
Body=[ ["val1", "val2", "val3"], ["val1", "val2", "val3"] ]
其中val1对应于应在col1项下的值。
如果我尝试createDataFrame(data=Body)
,则会出现错误cant infer schmea type for str
是否可以将这样的列表添加到pyspark数据框中?
我尝试将标头附加到正文中,例如
body.append(header),然后使用创建数据框函数,但会引发此错误:
field _22: Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.LongType'>
这是我用于生成正文和标头的整个代码:
基本上,我使用openpyxl读取excel文件,该文件会跳过前x行ect,仅读取具有某些列名的工作表。
生成正文和标题后,我想直接将其读入spark。
我们有一个承包商将其以csv的形式写出,然后使用spark进行阅读,但直接将其放入spark似乎更有意义。
我现在希望列都为字符串
import csv
from os import sys
excel_file = "/dbfs/{}".format(path)
wb = load_workbook(excel_file, read_only=True)
sheet_names = wb.get_sheet_names()
sheets = spark.read.option("multiline", "true").format("json").load(configPath)
if dataFrameContainsColumn(sheets, "sheetNames"):
config_sheets = jsonReader(configFilePath,"sheetNames")
else:
config_sheets= []
skip_rows=-1
#get a list of the required columns
required_fields_list = jsonReader(configFilePath,"requiredColumns")
for worksheet_name in sheet_names:
count=0
sheet_count=0
second_break=False
worksheet = wb.get_sheet_by_name(worksheet_name)
#assign the sheet name to the object sheet
#create empty header and body lists for each sheet
header = []
body = []
#for each row in the sheet we need to append the cells to the header and body
for i,row in enumerate(worksheet.iter_rows()):
#if the row index is greater then skip rows then we want to read that row in as the header
if i==skip_rows+1:
header.append([cell.value for cell in row])
elif i>skip_rows+1:
count=count+1
if count==1:
header=header[0]
header = [w.replace(' ', '_') for w in header]
header = [w.replace('.', '') for w in header]
if(all(elem in header for elem in required_fields_list)==False):
second_break=True
break
else:
count=2
sheet_count=sheet_count+1
body.append([cell.value for cell in row])```
答案 0 :(得分:0)
有几种方法可以从列表创建数据框。 您可以检出它们here
list_of_persons = [('Arike', 28, 78.6), ('Bob', 32, 45.32), ('Corry', 65, 98.47)]
df = sc.parallelize(list_of_persons).toDF(['name', 'age', 'score'])
df.printSchema()
df.show()
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- score: double (nullable = true)
+-----+---+-----+
| name|age|score|
+-----+---+-----+
|Arike| 28| 78.6|
| Bob| 32|45.32|
|Corry| 65|98.47|
+-----+---+-----+
list_of_persons = [('Arike', 28, 78.6), ('Bob', 32, 45.32), ('Corry', 65, 98.47)]
rdd = sc.parallelize(list_of_persons)
person = rdd.map(lambda x: Row(name=x[0], age=int(x[1]), score=float(x[2])))
schemaPeople = sqlContext.createDataFrame(person)
schemaPeople.printSchema()
schemaPeople.show()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
|-- score: double (nullable = true)
+---+-----+-----+
|age| name|score|
+---+-----+-----+
| 28|Arike| 78.6|
| 32| Bob|45.32|
| 65|Corry|98.47|
+---+-----+-----+