Question

我有一个任务是构建一个删除标点符号的函数“removePunctuation”，结果通过了这个测试：

# TEST Capitalization and punctuation (4b)
testPunctDF = sqlContext.createDataFrame([(" The Elephant's 4 cats. ",)])
testPunctDF.show()
Test.assertEquals(testPunctDF.select(removePunctuation(col('_1'))).first()[0],
                  'the elephants 4 cats',
                  'incorrect definition for removePunctuation function')

这是我设法写的。

def removePunctuation(column):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        column (Column): A Column containing a sentence.

    Returns:
        Column: A Column named 'sentence' with clean-up operations applied.
    """

    return lower(trim(regexp_replace("column_name", "[\W_]+"," "))).alias("sentence");

但我还是不能让函数regexp_replace使用别名“sentence”。我收到了这个错误：

AnalysisException：u“无法解析'给定输入列的'句子： [_1];“

Answer 1

我会尝试：

stringWithPunctuation.translate(None, string.punctuation)

在引擎盖下使用c，效率最好！

您的尝试：

return lower(trim(regexp_replace(, "[\W_]+"," "))).alias("sentence");

似乎没有在任何地方使用参数column，这可以解释错误。

Answer 2

令人惊讶的是，我只能在regexp_replace() args而不是列名中传递列对象。

如何获取Column的名称或更改现有名称？

2 个答案: