使用UDF

时间:2018-10-26 16:17:19

标签: python apache-spark pyspark apache-spark-sql

我正在尝试从Apache Spark中的另一列创建新列。

数据(大写缩写)看起来像

Date    Day_of_Week
2018-05-26T00:00:00.000+0000    5
2018-05-05T00:00:00.000+0000    6

并且应该看起来像

Date    Day_of_Week    Weekday
2018-05-26T00:00:00.000+0000    5    Thursday
2018-05-05T00:00:00.000+0000    6    Friday

我尝试了https://docs.databricks.com/spark/latest/spark-sql/udf-python.html#register-the-function-as-a-udfHow to pass a constant value to Python UDF?PySpark add a column to a DataFrame from a TimeStampType column手册中的建议

导致:

def int2day (day_int):
  if day_int == 1:
    return 'Sunday'
  elif day_int == 2:
    return 'Monday'
  elif day_int == 3:
    return 'Tuesday'
  elif day_int == 4:
    return 'Wednesday'
  elif day_int == 5:
    return 'Thursday'
  elif day_int == 6:
    return 'Friday'
  elif day_int == 7:
    return 'Saturday'
  else:
    return 'FAIL'

spark.udf.register("day", int2day, IntegerType())
df2 = df.withColumn("Day", day("Day_of_Week"))

并给出了很长的错误

SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 8, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 262, in main
    process()
  File "/databricks/spark/python/pyspark/worker.py", line 257, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/databricks/spark/python/pyspark/serializers.py", line 325, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/databricks/spark/python/pyspark/serializers.py", line 141, in dump_stream
    self._write_with_length(obj, stream)
  File "/databricks/spark/python/pyspark/serializers.py", line 151, in _write_with_length
    serialized = self.dumps(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 556, in dumps
    return pickle.dumps(obj, protocol)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

我在这里看不到如何应用How to pass a constant value to Python UDF?,因为他们的示例要简单得多(只有true或false)

我也尝试过使用地图功能,例如PySpark add a column to a DataFrame from a TimeStampType column

但是

df3 = df2.withColumn("weekday", map(lambda x: int2day, col("Date")))只是说TypeError: argument 2 to map() must support iteration,但我认为col 确实支持迭代。

我已经在线阅读了所有可以找到的示例。我看不到如何将其他问题问到我的案件中。

如何使用另一列的功能添加另一列?

1 个答案:

答案 0 :(得分:1)

您完全不需要在这里使用UDF即可完成您要执行的操作。您可以利用内置的pyspark <!DOCTYPE html> <html> <header> <style> #element3 { display: none; } </style> </header> <body> <label>Multi Select</label> <div id="element1"> <p id="element2" contenteditable="true">Select required competencies</p> </div> <div id="element3" class="auto-complete-select-dropdown"> <p>One</p> <p>Two</p> <p>Three</p> <p>Four</p> </div> <script> var element1 = document.getElementById("element1"); var element2 = document.getElementById("element2"); var element3 = document.getElementById("element3"); element2.addEventListener("keyup", showDropdown); element2.addEventListener("focusout", hideDropdown); element3.addEventListener("click", addSelectedOption); function showDropdown() { var element = document.getElementById("element3"); if (element.style.display != "block") element.style.display = "block"; } function hideDropdown() { var element = document.getElementById("element3"); if (element.style.display != "none") element.style.display = "none"; } function addSelectedOption(event) { alert("here"); element = event.target; var element1 = document.getElementById("element1"); var p = document.createElement('p'); p.textContent = element.textContent.trim(); } </script> </body> </html> 函数在列中给定日期的情况下提取一周中每一天的名称。

date_format

结果是一个新列添加到您的数据框中,名为import pyspark.sql.functions as func df = df.withColumn("day_of_week", func.date_format(func.col("Date"), "EEEE")) ,它将根据day_of_week列中的值显示星期日,星期一,星期二等。