如何在Pyspark中更改嵌套列的数据类型?对于示例,如何将值的数据类型从字符串更改为int?
参考:how to change a Dataframe column from String type to Double type in pyspark
{
"x": "12",
"y": {
"p": {
"name": "abc",
"value": "10"
},
"q": {
"name": "pqr",
"value": "20"
}
}
}
答案 0 :(得分:2)
您可以使用
读取json数据from pyspark import SQLContext
sqlContext = SQLContext(sc)
data_df = sqlContext.read.json("data.json", multiLine = True)
data_df.printSchema()
输出
root
|-- x: long (nullable = true)
|-- y: struct (nullable = true)
| |-- p: struct (nullable = true)
| | |-- name: string (nullable = true)
| | |-- value: long (nullable = true)
| |-- q: struct (nullable = true)
| | |-- name: string (nullable = true)
| | |-- value: long (nullable = true)
现在您可以从y列访问数据
data_df.select("y.p.name")
data_df.select("y.p.value")
输出
abc, 10
好的,解决方案是添加一个具有正确架构的新嵌套列,并删除具有错误架构的列
from pyspark.sql.functions import *
from pyspark.sql import Row
df3 = spark.read.json("data.json", multiLine = True)
# create correct schema from old
c = df3.schema['y'].jsonValue()
c['name'] = 'z'
c['type']['fields'][0]['type']['fields'][1]['type'] = 'long'
c['type']['fields'][1]['type']['fields'][1]['type'] = 'long'
y_schema = StructType.fromJson(c['type'])
# define a udf to populate the new column. Row are immuatable so you
# have to build it from start.
def foo(row):
d = Row.asDict(row)
y = {}
y["p"] = {}
y["p"]["name"] = d["p"]["name"]
y["p"]["value"] = int(d["p"]["value"])
y["q"] = {}
y["q"]["name"] = d["q"]["name"]
y["q"]["value"] = int(d["p"]["value"])
return(y)
map_foo = udf(foo, y_schema)
# add the column
df3_new = df3.withColumn("z", map_foo("y"))
# delete the column
df4 = df3_new.drop("y")
df4.printSchema()
输出
root
|-- x: long (nullable = true)
|-- z: struct (nullable = true)
| |-- p: struct (nullable = true)
| | |-- name: string (nullable = true)
| | |-- value: long (nullable = true)
| |-- q: struct (nullable = true)
| | |-- name: string (nullable = true)
| | |-- value: long (nullable = true)
df4.show()
输出
+---+-------------------+
| x| z|
+---+-------------------+
| 12|[[abc,10],[pqr,10]]|
+---+-------------------+
答案 1 :(得分:0)
使用任意变量名称似乎很简单,但这是有问题的,与PEP8相反。在处理数字时,我建议避免使用迭代这些结构的常用名称...即值。
import json
with open('random.json') as json_file:
data = json.load(json_file)
for k, v in data.items():
if k == 'y':
for key, item in v.items():
item['value'] = float(item['value'])
print(type(data['y']['p']['value']))
print(type(data['y']['q']['value']))
# mac → python3 make_float.py
# <class 'float'>
# <class 'float'>
json_data = json.dumps(data, indent=4, sort_keys=True)
with open('random.json', 'w') as json_file:
json_file.write(json_data)