Question

我正在读取df来激发，并希望应用一个函数从包含list的{{1}}的列中提取数据。

当我读取文件并打印其类型时，会得到以下内容。

dicts

这是dfCredits：

dfCredits = spark.read.option("header","true").option("delimiter",";").csv(folder+'credits.csv',inferSchema =True).drop('_c0')
print(dfCredits.dtypes)
#[('cast', 'string'), ('crew', 'string'), ('id', 'string')]

这是我要应用的功能：

+--------------------+--------------------+-----+
|                cast|                crew|   id|
+--------------------+--------------------+-----+
|[{'cast_id': 14, ...|"[{'credit_id': '...|  862|
|[{'cast_id': 1, '...|[{'credit_id': '5...| 8844|
|[{'cast_id': 2, '...|[{'credit_id': '5...|15602|
|"[{'cast_id': 1, ...|[{'credit_id': '5...|31357|
|[{'cast_id': 1, '...|[{'credit_id': '5...|11862|
|"[{'cast_id': 25,...|"[{'credit_id': '...|  949|
|[{'cast_id': 1, '...|[{'credit_id': '5...|11860|
|[{'cast_id': 2, '...|[{'credit_id': '5...|45325|
|[{'cast_id': 1, '...|[{'credit_id': '5...| 9091|
|[{'cast_id': 1, '...|[{'credit_id': '5...|  710|
|"[{'cast_id': 1, ...|[{'credit_id': '5...| 9087|

然后我创建def getDirector(x): if type(x) == str: x = eval(x) for crew in x: if crew.get('job') == 'Director': return crew.get('name') return None

udf

并应用该功能。

getDirUDF = udf(lambda x: getDirector(x),StringType())

我收到以下错误：

dfCredits.select('id','cast',getDirUDF('cast').alias('director'))

这似乎是由于试图进入不是dict的对象而引起的，但是，如果我添加了一个异常，则所有这些对象都会陷入该异常中。

此外，当我尝试检查列中各个元素的类型时，会得到以下信息：

AttributeError: 'str' object has no attribute 'get'

我想知道什么是getDirUDF = udf(lambda x: type(x)) dfCredits.select('id','cast',getDirUDF('cast').alias('typeCast')) +-----+--------------------+--------------------+ | id| cast| typeCast| +-----+--------------------+--------------------+ | 862|[{'cast_id': 14, ...|net.razorvine.pic...| | 8844|[{'cast_id': 1, '...|net.razorvine.pic...| |15602|[{'cast_id': 2, '...|net.razorvine.pic...| |31357|"[{'cast_id': 1, ...|net.razorvine.pic...| |11862|[{'cast_id': 1, '...|net.razorvine.pic...| | 949|"[{'cast_id': 25,...|net.razorvine.pic...|（全名）以及如何使用它。

pyspark读取csv net.razorvine.pickle列

0 个答案: