SampleCSV2.csv的数据如下:
AAA|25|IT|50.5
BBB|28|Comp|100.5
在pyspark
中创建具有特定数据类型的数据帧时出现问题from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType, DateType, TimestampType
from pyspark.sql.types import *
def structSchema(cols):
field = []
for c in cols:
fieldType = c[c.find(":")+1:]
fieldName = c[:c.find(":")]
print(fieldType,"P",fieldName)
if fieldType == "int":
field.append(StructField(fieldName, IntegerType(), True))
elif fieldType == "double":
field.append(StructField(fieldName, DoubleType(), True))
else :
field.append(StructField(fieldName, StringType(), True))
return StructType(field)
conf = SparkConf().setAppName('OutputGenerator')
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
file =r'C:\Users\Desktop\SampleCSV2.csv'
delim = '|'
#This part works well
cols = ['Name:string','Age:string','Dept:string','Salary:string']
rdd = sc.textFile(file).map(lambda x: x.split(delim))
cols = structSchema(cols)
df1 = sqlContext.createDataFrame(rdd, schema=cols)
print(df1.toPandas()) #works fine
#Getting error while executing below part
cols = ['Name:string','Age:int','Dept:string','Salary:double']
rdd = sc.textFile(file).map(lambda x: x.split(delim))
cols = structSchema(cols)
df2 = sqlContext.createDataFrame(rdd, schema=cols)
print(df2.toPandas()) #Getting Error
print("Done...")
错误到行print(df2.toPandas())
请帮我定义模式并将csv加载到具有特定数据类型的数据帧。当所有列都是stringType()时,模式定义有效,但是当我定义类似Integer,Double的类型时,模式定义失败。
任何帮助非常感谢。
答案 0 :(得分:0)
TL; DR 使用spark.read.schema(cols).options(delimiter="|", header="false").csv(file)
读者:
sc.textFile(file).map(lambda x: x.split(delim)).map(lambda x: (
(x[0], int(x[1]), x[2], float(x[3]))
))
您的代码无法正常工作,因为架构与数据不匹配。您必须将声明值类型设置为声明类型:
sem = asyncio.Semaphore(10)
async def do_job(args):
async with sem: # Don't run more than 10 simultaneous jobs below
proc = await asyncio.create_subprocess_shell(args, stdout=PIPE)
output = await proc.stdout.read()
return output