我有一个python RDD:
rddstats = rddstats.filter(lambda x : len(x) == NB_LINE or len(x) == NB2_LINE)
我根据此RDD创建了一个数据框:
logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])
我想对两个columns 6 and 7
进行测试:
如果数据帧中存在第6列且不为null,则应返回包含此column 6
值的数据帧,否则,我应返回包含column 7
值的数据帧。
以下是我的小代码:
logsDF = sqlContext.createDataFrame(rddstats,schema=["column1","column2","column3","column4","column5","column6","column7"])
if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull):
logsDF.select("column1","column2","column3","column4","column5","column6")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
语法是否正确,我是否有权像这样用Python编写?
答案 0 :(得分:2)
if (logsDF['column6'] in rddstats and logsDF['column6'].isNotNull)
我很确定,如果column6不存在,您将抛出KeyError。
您可以执行以下操作:
if 'column6' in logsDF.columns:
if logsDF['column6'].notnull().any():
logsDF.select("column1","column2","column3","column4","column5","column6")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
首先检查在logsDF列中是否存在column6。 如果是这样,请查看any()值是否不为空。
如果column6不存在,或者column6存在但所有值均为空,则使用column7。
编辑我自己的评论: 由于如果第一个条件为False,则python不会评估第二个条件,因此您可以执行以下操作:
if 'column6' in logsDF.columns and logsDF['column6'].notnull().any():
logsDF.select("column1","column2","column3","column4","column5","column6")
else:
logsz84statsDF.select("column1","column2","column3","column4","column5","column7")
只要logsDF.columns中的'column6'首先出现, logsDF ['column6'] 将永远不会评估并抛出KeyError,如果column6没有存在。
答案 1 :(得分:1)
if set(['A','C']).issubset(df.columns):
df['sum'] = df['A'] + df['C']
set([])
可以用大括号构造:
if {'A', 'C'}.issubset(df.columns):
有关大括号语法的讨论,请参见此问题。
或者,您可以使用列表推导,如:
if all([item in df.columns for item in ['A','C']]):
答案 2 :(得分:0)
我认为这可能会更快
if 'column_name' not in df.columns:
do_something
if len([x in x for df['column_name'].unique() if x.isna()]) > 0:
do_something_else