我有多个json文件,我希望用它来创建一个火花数据帧。在使用子集进行测试时,当我加载文件时,我自己获取json信息的行而不是解析的json信息。我正在做以下事情:
df = spark.read.json('gutenberg/test')
df.show()
+--------------------+--------------------+--------------------+
| 1| 10| 5|
+--------------------+--------------------+--------------------+
| null|[WrappedArray(),W...| null|
| null| null|[WrappedArray(Uni...|
|[WrappedArray(Jef...| null| null|
+--------------------+--------------------+--------------------+
当我检查数据帧的架构时,它似乎在那里,但我无法访问它:
df.printSchema()
root
|-- 1: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 10: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
|-- 5: struct (nullable = true)
| |-- author: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- formaturi: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- language: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- rights: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- subject: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- txt: string (nullable = true)
我在尝试访问信息时遇到错误,所以任何帮助都会很棒。
具体来说,我希望创建一个新的数据框,其中的列是(' author',' formaturi',' language',' rights& #39;,' subject',' title',' txt')
我正在使用pyspark 2.2
答案 0 :(得分:0)
由于我不知道json文件是什么样的,假设它是一个新的分隔jsons的行,这应该可行。
def _construct_key(previous_key, separator, new_key):
if previous_key:
return "{}{}{}".format(previous_key, separator, new_key)
else:
return new_key
def flatten(nested_dict, separator="_", root_keys_to_ignore=set()):
assert isinstance(nested_dict, dict)
assert isinstance(separator, str)
flattened_dict = dict()
def _flatten(object_, key):
if isinstance(object_, dict):
for object_key in object_:
if not (not key and object_key in root_keys_to_ignore):
_flatten(object_[object_key], _construct_key(key,\
separator, object_key))
elif isinstance(object_, list) or isinstance(object_, set):
for index, item in enumerate(object_):
_flatten(item, _construct_key(key, separator, index))
else:
flattened_dict[key] = object_
_flatten(nested_dict, None)
return flattened_dict
def flatten(_json):
return flatt(_json.asDict(True))
df = spark.read.json('gutenberg/test',\
primitivesAsString=True,\
allowComments=True,\
allowUnquotedFieldNames=True,\
allowNumericLeadingZero=True,\
allowBackslashEscapingAnyCharacter=True,\
mode='DROPMALFORMED')\
.rdd.map(flatten).toDF()
df.show()