替换Pyspark

时间:2017-07-11 10:21:47

标签: python apache-spark dataframe casting pyspark

我有一个包含某些属性的数据框,它具有下一个外观:

+-------+-------+
| Atr1  | Atr2  |
+-------+-------+
|  3,06 |  4,08 |
|  3,03 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
|  ...  |  ...  |
+-------+-------+

如您所见,数据框的Atr1和Atr2的值是具有“,”字符的数字。这是因为我从CSV中加载了这些数据,其中DoubleType数字的小数由','表示。

当我将数据加载到数据框中时,值被强制转换为String,因此我将这些属性的String从String应用到DoubleType:

df = df.withColumn("Atr1", df["Atr1"].cast(DoubleType()))
df = df.withColumn("Atr2", df["Atr2"].cast(DoubleType()))

但是当我这样做时,值会转换为null

+-------+-------+
| Atr1  | Atr2  |
+-------+-------+
|  null |  null |
|  null |  null |
|  null |  null |
|  null |  null |
|  null |  null |
|  ...  |  ...  |
+-------+-------+

我猜原因是DoubleType小数必须用'。'分隔。而不是','。但我没有机会编辑CSV文件,所以我想用'。'替换Dataframe中的','符号。然后将转换应用于DoubleType。

我怎么能这样做?

4 个答案:

答案 0 :(得分:5)

您可以使用用户定义的函数简单地解决此问题。

class MyClass{
public:
   enum M1 {
          MY_VAL1 = 0, 
          MY_VAL2,
          MY_VAL3
    };

   enum M2 {
          MY_VA1 = 0, 
          MY_VA2,
          MY_VA3
    };


   enum M3 {
          MY_V1 = 0, 
          MY_V2,
          MY_V3
    };

    M1 obj1;
    M2 obj2;
    M3 obj3;
};

static const MyClass foo = { MyClass::MY_VAL1, MyClass::MY_VA1, MyClass::MY_V1 };

编辑: 根据意见建议,更紧凑的解决方案。

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import *

data = [Row(Atr1="3,06", Atr2="4,08"),
        Row(Atr1="3,06", Atr2="4,08"),
        Row(Atr1="3,06", Atr2="4,08")]

df = sqlContext.createDataFrame(data)

# Create an user defined function to replace ',' for '.'
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType())

out = df
   .withColumn("Atr1", udf(col("Atr1")).cast(DoubleType()))
   .withColumn("Atr2", udf(col("Atr2")).cast(DoubleType()))

##############################################################
out.show()

+----+----+
|Atr1|Atr2|
+----+----+
|3.06|4.08|
|3.06|4.08|
|3.06|4.08|
+----+----+

##############################################################

out.printSchema()

root
 |-- Atr1: double (nullable = true)
 |-- Atr2: double (nullable = true)

答案 1 :(得分:1)

我们假设你有:

sdf.show()
+-------+-------+
|   Atr1|   Atr2|
+-------+-------+
|  3,06 |  4,08 |
|  3,03 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
+-------+-------+

然后,以下代码将产生所需的结果:

strToDouble = udf(lambda x: float(x.replace(",",".")), DoubleType())

sdf = sdf.withColumn("Atr1", strToDouble(sdf['Atr1']))
sdf = sdf.withColumn("Atr2", strToDouble(sdf['Atr2']))

sdf.show()
+----+----+
|Atr1|Atr2|
+----+----+
|3.06|4.08|
|3.03|4.08|
|3.06|4.08|
|3.06|4.08|
|3.06|4.08|
+----+----+

答案 2 :(得分:0)

您也可以只使用SQL。

val df = sc.parallelize(Array(
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08")
      )).toDF("a", "b")

df.registerTempTable("test")

val doubleDF = sqlContext.sql("select cast(trim(regexp_replace( a , ',' , '.')) as double) as a from test ")

doubleDF.show
+----+
|   a|
+----+
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
+----+

doubleDF.printSchema
root
 |-- a: double (nullable = true)

答案 3 :(得分:0)

是否可以将列名作为参数传递给示例代码中的col()函数? 像这样:

# Create an user defined function to replace ',' for '.'
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType())

col_name1 = "Atr1"
col_name2 = "Atr2"

out = df
   .withColumn(col_name1, udf(col(col_name1)).cast(DoubleType()))
   .withColumn(col_name2, udf(col(col_name2)).cast(DoubleType()))