作为参数在Spark列上运行的函数

时间:2016-07-16 09:51:10

标签: python apache-spark pyspark

编辑:最后我自己弄明白了。我一直在select()内使用column功能,这就是为什么它不起作用。我在原始问题中添加了我的解决方案作为评论,以防它可能对其他人有用。

我正在开设一个在线课程,我应该编写以下函数:

# TODO: Replace <FILL IN> with appropriate code

# Note that you shouldn't use any RDD operations or need to create custom user defined functions (udfs) to accomplish this task

from pyspark.sql.functions import regexp_replace, trim, col, lower

def removePunctuation(column):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        column (Column): A Column containing a sentence.

    Returns:
        Column: A Column named 'sentence' with clean-up operations applied.
    """

    # EDIT: MY SOLUTION
    # column = lower(column)
    # column = regexp_replace(column, r'([^a-z\d\s])+', r'')
    # return trim(column).alias('sentence')

    return <FILL IN>

sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
                                         (' No under_score!',),
                                         (' *      Remove punctuation then spaces  * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
 .select(removePunctuation(col('sentence')))
 .show(truncate=False))

我已经编写了代码,为DataFrame本身的操作提供了所需的输出:

# Lower case
lower = sentenceDF.select(lower(col('sentence')).alias('lower'))
lower.show()

# Remove Punctuation
cleaned = lower.select(regexp_replace(col('lower'), r'([^a-z\d\s])+', r'').alias('cleaned'))
cleaned.show()

# Trim
sentenceDF = cleaned.select(trim(col('cleaned')).alias('sentence'))
sentenceDF.show(truncate=False)

我只是不知道如何在我的函数中实现此代码,因为它不会在DataFrame上运行,而只会在给定的column上运行。我尝试了不同的方法,一种方法是使用

DataFrame输入中创建一个新的column
[...]
df = sqlContext.createDataFrame(column, ['sentence'])
[...]

在函数中,但它不起作用:TypeError: Column is not iterable。其他方法试图直接在函数内column上运行,始终导致TypeError: 'Column' object is not callable

我几天前开始使用(Py)Spark,但仍然存在关于如何仅处理行和列的概念性问题。我真的很感激当前问题上的任何帮助。

1 个答案:

答案 0 :(得分:0)

您可以在一行中完成此操作。

return re.sub(r'[^a-z0-9\s]','',text.lower().strip()).strip()