我有一个包含字符串数据的Spark数据帧,我想映射到数字数据,如下所示(简单版本):
+--------------------+-------+----------+-------------------------+
| participantUUID|001_Age|002_Gender|003_Where did you grow up|
+--------------------+-------+----------+-------------------------+
|010A0550-4324-490...| 23| Female| In a town|
|031C5411-FE42-429...| 56| Male| In a town|
|038688FF-B5DA-484...| 32| Female| In a town|
|05F8E1AF-AFDD-441...| 54| Female| Multiple places|
|068B213C-3303-41E...| 23| Female| In a town|
|11A9A444-3E93-468...| 39| Female| In a town|
有许多列,而不是逐列应用映射,我想在整个数据帧中逐列应用映射。
从字符串到数字的映射因列而异。例如,对于一列,字符串"差","公平","良好","非常好"会吸引1,2,3,4分;对于另一列,分数可能是4,3,2,1。所以,我想开发一个udf,它将列标题和字符串值作为参数,然后根据dataframe列应用Foldleft函数,如下所示:
val calculateScore = udf((columnName: String, answerText: String) => (columnName, answerText) match {
case ("002_Gender", "Female") => 0
case ("002_Gender", "Male") => 1
case ("002_Gender", "Other") => 2
case ("003_Where did you grow up", "In a village") => 0
case ("003_Where did you grow up", "In a town") => 1
case ("003_Where did you grow up", "Multiple places") => 2
case _ => -1
})
val columnNames = Seq("001_Age", "002_Gender", "003_Where did you grow up")
val newDF: DataFrame = columnNames.foldLeft(baseDF)(
(baseDF, c) =>
baseDF.withColumn(c.concat("_numeric"), calculateScore(baseDF(c), baseDF(c)))
)
然而,这并没有返回正确的结果 - 所有结果都显示为-1,这意味着udf无法正确匹配:
+--------------------+----------------+----------+------------------+-------------------------+---------------------------------+
| participantUUID|assessmentNumber|002_Gender|002_Gender_numeric|003_Where did you grow up|003_Where did you grow up_numeric|
+--------------------+----------------+----------+------------------+-------------------------+---------------------------------+
|010A0550-4324-490...| 0| Female| -1| In a town| -1|
|031C5411-FE42-429...| 0| Male| -1| In a town| -1|
|038688FF-B5DA-484...| 0| Female| -1| In a town| -1|
|05F8E1AF-AFDD-441...| 0| Female| -1| Multiple places| -1|
|068B213C-3303-41E...| 0| Female| -1| In a town| -1|
我认为这是由于calculateScore
udf语句的语法,它应该获取字符串列名和答案文本并返回一个int,在列中逐行评估。换句话说,foldLeft语句的格式为:
val newDF: DataFrame = columnNames.foldLeft[DataFrame](baseDF)(
(acc, c) =>
acc.withColumn(c, col(c))
)
所以calculateScore(baseDF(c), baseDF(c))
应该返回一个Column类型的对象 - 但显然出现了问题。
任何想法都会非常感谢,谢谢!
NB。我已经回顾了Apply UDF to multiple columns in Spark Dataframe但我不喜欢使用var DF的想法,因为在我看来这违反了Scala中不可变编程的原则!
答案 0 :(得分:0)
您将完全相同的参数传递给UDF,因此列值将作为两个参数传递,并且与默认Public Sub TestMe()
Dim wf As WorksheetFunction
Dim holidays(3) As Long 'As Date does not work
Dim wdFive As Date
Set wf = Application.WorksheetFunction
holidays(0) = DateSerial(Year(Date) - 1, 12, 25) ' last Christmas
holidays(1) = DateSerial(Year(Date) - 1, 12, 26) ' Last Boxing day
holidays(2) = DateSerial(Year(Date), 1, 1) ' News year day
wdFive = wf.WorkDay(DateSerial(Year(Date), Month(Date), 1), 4, holidays)
Debug.Print wdFive
End Sub
您需要将case _
作为第一个参数传递。
lit(c)
答案 1 :(得分:0)
var baseDF=Seq(("Female","In a town"),("Male","Multiple places")).toDF("002_Gender","003_Where did you grow up")
baseDF.show
+----------+-------------------------+
|002_Gender|003_Where did you grow up|
+----------+-------------------------+
| Female| In a town|
| Male| Multiple places|
+----------+-------------------------+
def calculateScore(columnName: String) = udf((answerText: String) => (columnName, answerText) match {
case ("002_Gender", "Female") => 0
case ("002_Gender", "Male") => 1
case ("002_Gender", "Other") => 2
case ("003_Where did you grow up", "In a village") => 0
case ("003_Where did you grow up", "In a town") => 1
case ("003_Where did you grow up", "Multiple places") => 2
case _ => -1
})
val columnNames = Seq("002_Gender", "003_Where did you grow up")
val newDF = columnNames.foldLeft(baseDF)(
(baseDF, c) =>
baseDF.withColumn(c.concat("_numeric"), calculateScore(c)(baseDF(c)))
)
newDF.show