从pyspark中的Spark DF中仅选择数字/字符串列名称

时间:2017-05-19 09:43:25

标签: python apache-spark pyspark

我在pyspark(2.1.0)中有一个SparkDataFrame,我希望只获取数字列的名称或仅获取字符串列。

例如,这是我的DF的架构:

root
 |-- Gender: string (nullable = true)
 |-- SeniorCitizen: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: double (nullable = true)
 |-- Churn: string (nullable = true)

这就是我需要的:

num_cols = [MonthlyCharges, TotalCharges]
str_cols = [Gender, SeniorCitizen, Churn]

我该怎么做?谢谢!

3 个答案:

答案 0 :(得分:11)

dtypes是元组列表(columnNane,type),你可以使用简单的过滤器

var string = "<?xml version='1.0' encoding='UTF-8' standalone='no'?><svgxmlns='http://www.w3.org/2000/svg...'";

string = string.replace("<?xml version='1.0' encoding='UTF-8' standalone='no'?>","");

console.log(string)

答案 1 :(得分:3)

PySpark提供了与架构types相关的丰富API。正如@DanieldePaula所述,您可以通过df.schema.fields访问字段的元数据。

这是基于静态类型检查的另一种方法:

from pyspark.sql.types import StringType, DoubleType

df = spark.createDataFrame([
  [1, 2.3, "t1"],
  [2, 5.3, "t2"],
  [3, 2.1, "t3"],
  [4, 1.5, "t4"]
], ["cola", "colb", "colc"])

# get string
str_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]
# ['colc']

# or double
dbl_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, DoubleType)]
# ['colb']

答案 2 :(得分:0)

您可以执行zlidme建议的操作以仅获取字符串(分类列)。要扩展给出的答案,请看下面的示例。它将为您提供一个名为ContinuousCols的列表中的所有数字(连续)列,一个名为categoricalCols的列表中的所有分类列以及一个名为allCols的列表中的所有列。

data = {'mylongint': [0, 1, 2],
        'shoes': ['blue', 'green', 'yellow'],
        'hous': ['furnitur', 'roof', 'foundation'],
        'C': [1, 0, 0]}

play_df = pd.DataFrame(data)
play_ddf = spark.createDataFrame(play_df)

#store all column names in a list
allCols = [item[0] for item in play_ddf]

#store all column names that are categorical in a list
categoricalCols = [item[0] for item in play_ddf.dtypes if item[1].startswith('string')]

#store all column names that are continous in a list
continuousCols =[item[0] for item in play_ddf.dtypes if item[1].startswith('bigint')]

print(len(allCols), ' - ', len(continuousCols), ' - ', len(categoricalCols))

这将得出结果:4-2-2