编辑:最后我自己弄明白了。我一直在select()
内使用column
功能,这就是为什么它不起作用。我在原始问题中添加了我的解决方案作为评论,以防它可能对其他人有用。
我正在开设一个在线课程,我应该编写以下函数:
# TODO: Replace <FILL IN> with appropriate code
# Note that you shouldn't use any RDD operations or need to create custom user defined functions (udfs) to accomplish this task
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
# EDIT: MY SOLUTION
# column = lower(column)
# column = regexp_replace(column, r'([^a-z\d\s])+', r'')
# return trim(column).alias('sentence')
return <FILL IN>
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
我已经编写了代码,为DataFrame
本身的操作提供了所需的输出:
# Lower case
lower = sentenceDF.select(lower(col('sentence')).alias('lower'))
lower.show()
# Remove Punctuation
cleaned = lower.select(regexp_replace(col('lower'), r'([^a-z\d\s])+', r'').alias('cleaned'))
cleaned.show()
# Trim
sentenceDF = cleaned.select(trim(col('cleaned')).alias('sentence'))
sentenceDF.show(truncate=False)
我只是不知道如何在我的函数中实现此代码,因为它不会在DataFrame
上运行,而只会在给定的column
上运行。我尝试了不同的方法,一种方法是使用
DataFrame
输入中创建一个新的column
[...]
df = sqlContext.createDataFrame(column, ['sentence'])
[...]
在函数中,但它不起作用:TypeError: Column is not iterable
。其他方法试图直接在函数内column
上运行,始终导致TypeError: 'Column' object is not callable
。
我几天前开始使用(Py)Spark
,但仍然存在关于如何仅处理行和列的概念性问题。我真的很感激当前问题上的任何帮助。
答案 0 :(得分:0)
您可以在一行中完成此操作。
return re.sub(r'[^a-z0-9\s]','',text.lower().strip()).strip()