我有一个下面的pyspark代码。我正在从Rest API读取json数据,并尝试使用pyspark加载。 但是我无法在Spark中读取DataFrame中的数据。有人可以帮上忙。
import urllib
from pyspark.sql.types import StructType,StructField,StringType
schema = StructType([StructField('dropoff_latitude',StringType(),True),\
StructField('dropoff_longitude',StringType(),True),
StructField('extra',StringType(),True),\
StructField('fare_amount',StringType(),True),\
StructField('improvement_surcharge',StringType(),True),\
StructField('lpep_dropoff_datetime',StringType(),True),\
StructField('mta_tax',StringType(),True),\
StructField('passenger_count',StringType(),True),\
StructField('payment_type',StringType(),True),\
StructField('pickup_latitude',StringType(),True),\
StructField('ratecodeid',StringType(),True),\
StructField('tip_amount',StringType(),True),\
StructField('tolls_amount',StringType(),True),\
StructField('total_amount',StringType(),True),\
StructField('trip_distance',StringType(),True),\
StructField('trip_type',StringType(),True),\
StructField('vendorid',StringType(),True)
])
url = 'https://data.cityofnewyork.us/resource/pqfs-mqru.json'
data = urllib.request.urlopen(url).read().decode('utf-8')
rdd = sc.parallelize(data)
df = spark.createDataFrame(rdd,schema)
df.show()```
**The Error message is TypeError: StructType can not accept object '[' in type <class 'str'>**
** I have been able to do using dataset in scala but i am not able to understand why its not possible using python **
导入spark.implicits ._
///从2016年绿色出租车行程数据的纽约出租车数据REST API加载数据 val url =“ https://data.cityofnewyork.us/resource/pqfs-mqru.json” val result = scala.io.Source.fromURL(url).mkString
///从JSON数据创建数据框 val taxiDF = spark.read.json(Seq(result).toDS)
//显示包含行程数据的数据框 taxiDF.show()
答案 0 :(得分:0)
只为别人.. 这是对我有用的代码。Request .get返回一个列表
import requests
import json
from pyspark.sql.types import StructType,StructField,StringType
schema = StructType([StructField('dropoff_latitude',StringType(),True),\
StructField('dropoff_longitude',StringType(),True),
StructField('extra',StringType(),True),\
StructField('fare_amount',StringType(),True),\
StructField('improvement_surcharge',StringType(),True),\
StructField('lpep_dropoff_datetime',StringType(),True),\
StructField('mta_tax',StringType(),True),\
StructField('passenger_count',StringType(),True),\
StructField('payment_type',StringType(),True),\
StructField('pickup_latitude',StringType(),True),\
StructField('ratecodeid',StringType(),True),\
StructField('tip_amount',StringType(),True),\
StructField('tolls_amount',StringType(),True),\
StructField('total_amount',StringType(),True),\
StructField('trip_distance',StringType(),True),\
StructField('trip_type',StringType(),True),\
StructField('vendorid',StringType(),True)
])
url = 'https://data.cityofnewyork.us/resource/pqfs-mqru.json'
r = requests.get(url)
data_json = r.json()
df = spark.createDataFrame(data_json,schema)
display(df)