Question

我有一个 pyspark数据框，其中包含很多列，并且我想选择包含某个字符串的列和其他列。例如：

df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']

我要选择包含'hello'的列以及名为'index'的列，因此结果将是：

['hello_world','hello_country','hello_everyone','index']

我想要类似df.select('hello*','index')

预先感谢：）

编辑：

我找到了解决问题的快捷方法，所以我以 Q＆A 的方式回答了自己。如果有人看到我的解决方案并且可以提供更好的解决方案，我将不胜感激

Answer 1

我找到了一种快速而优雅的方法：

selected = [s for s in df.columns if 'hello' in s]+['index']
df.select(selected)

使用此解决方案，我可以添加更多我想要的列，而无需编辑 Ali AzG 建议的for循环。

Answer 2

您还可以尝试使用Spark 2.3中引入的colRegex函数，在其中您还可以将列名称指定为正则表达式。

希望有帮助。

此致

Neeraj

Answer 3

此示例代码可以满足您的要求：

hello_cols = []

for col in df.columns:
  if(('index' in col) or ('hello' in col)):
    hello_cols.append(col)

df.select(*hello_cols)