如何将字符串与rdd的字段名称匹配

时间:2017-10-17 14:54:42

标签: python pyspark

在我的pyspark 2.0.1版本中,我需要检查特定名称[说客户端]是否出现在我的rdd列名称中&如果我的数据名称中没有该字段,客户端,则会生成错误消息。您可以建议一些语法,如下面的语法

field='client'
field not in df.schema.fields:
print('field: ', field, "is not available)

1 个答案:

答案 0 :(得分:1)

RDD

spark.version
# u'2.2.0'

# make some dummy data:
rdd = sc.parallelize([[u'mailid', u'age', u'address'], [u'satya', u'23', u'Mumbai'], [u'abc', u'27', u'Goa']])  # first element is the header
header = rdd.first()
header
# [u'mailid', u'age', u'address']

field = 'client'
if field not in header:
  print('field: '+ field + " is not available")
# field: client is not available

对于数据框

# using the rdd defined above
# remove first line from data and use it as header:
df = rdd.filter(lambda row : row != header).toDF(header)
df.show()
# +------+---+-------+ 
# |mailid|age|address| 
# +------+---+-------+
# | satya| 23| Mumbai|
# |   abc| 27|    Goa|
# +------+---+-------+

header_df = df.schema.names
header_df
# ['mailid', 'age', 'address']

field = 'client'
if field not in header_df:
  print('field: '+ field + " is not available")
# field: client is not available