在pyspark中创建具有特定数据类型的数据帧时出现问题

时间:2017-09-04 08:57:33

标签: apache-spark pyspark

SampleCSV2.csv的数据如下:

AAA|25|IT|50.5
BBB|28|Comp|100.5

在pyspark

中创建具有特定数据类型的数据帧时出现问题
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType, DateType, TimestampType
from pyspark.sql.types import * 

def structSchema(cols):
    field = []
    for c in cols:
        fieldType = c[c.find(":")+1:]
        fieldName = c[:c.find(":")]
        print(fieldType,"P",fieldName)
        if fieldType == "int":
            field.append(StructField(fieldName, IntegerType(), True))
        elif fieldType == "double":
            field.append(StructField(fieldName, DoubleType(), True))
        else :
            field.append(StructField(fieldName, StringType(), True))
    return StructType(field)

conf = SparkConf().setAppName('OutputGenerator')
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
file =r'C:\Users\Desktop\SampleCSV2.csv'

delim = '|'

#This part works well
cols = ['Name:string','Age:string','Dept:string','Salary:string']
rdd = sc.textFile(file).map(lambda x: x.split(delim))
cols = structSchema(cols)
df1 = sqlContext.createDataFrame(rdd, schema=cols)
print(df1.toPandas()) #works fine


#Getting error while executing below part
cols = ['Name:string','Age:int','Dept:string','Salary:double']
rdd = sc.textFile(file).map(lambda x: x.split(delim))
cols = structSchema(cols)
df2 = sqlContext.createDataFrame(rdd, schema=cols)
print(df2.toPandas())  #Getting Error

print("Done...")

错误到行print(df2.toPandas())请帮我定义模式并将csv加载到具有特定数据类型的数据帧。当所有列都是stringType()时,模式定义有效,但是当我定义类似Integer,Double的类型时,模式定义失败。

任何帮助非常感谢。

1 个答案:

答案 0 :(得分:0)

TL; DR 使用spark.read.schema(cols).options(delimiter="|", header="false").csv(file) 读者:

sc.textFile(file).map(lambda x: x.split(delim)).map(lambda x: (
    (x[0], int(x[1]), x[2], float(x[3]))
))

您的代码无法正常工作,因为架构与数据不匹配。您必须将声明值类型设置为声明类型:

sem = asyncio.Semaphore(10)

async def do_job(args):
    async with sem:  # Don't run more than 10 simultaneous jobs below
        proc = await asyncio.create_subprocess_shell(args, stdout=PIPE)
        output = await proc.stdout.read()
        return output