我的Spark数据帧列中有一些奇怪的字符。我想删除它。当我选择那个特定列并执行.show()时,我看到它如下
Dominant technology firm seeks ambitious, assertive, confident, headstrong salesperson to lead our organization into the next era! If you are ready to thrive in a highly competitive environment, this is the job for you. ¥ Superior oral and written communication skills¥ Extensive experience with negotiating and closing sales ¥ Outspoken ¥ Thrives in competitive environment¥ Self-reliant and able to succeed in an independent setting ¥ Manage portfolio of clients ¥ Aggressively close sales to exceed quarterly quotas ¥ Deliver expertise to clients as needed ¥ Lead the company into new markets
|
你看到的角色是¥。
我编写了以下代码,将其从数据框的“描述”列中删除
from pyspark.sql.functions import udf
charReplace=udf(lambda x: x.replace('¥',''))
train_cleaned=train_triLabel.withColumn('dsescription',charReplace('description'))
train_cleaned.show(2,truncate=False)
然而它会抛出错误:
File "/Users/i854319/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/Users/i854319/spark/python/pyspark/sql/functions.py", line 1563, in <lambda>
func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it)
File "<ipython-input-32-864efe6f3257>", line 3, in <lambda>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
但是,当我在测试字符串上执行此操作时,替换方法会识别该字符。
s='hello ¥'
print s
s.replace('¥','')
hello ¥
Out[37]:
'hello '
知道我哪里错了吗?
答案 0 :(得分:3)
使用Unicode文字:
charReplace = udf(lambda x: x.replace(u'¥',''))