Question

需要从rdd。

中删除不可打印的字符

示例数据在

之下

"@TSX•","None"
"@MJU•","None"

预期产出

@TSX,None
@MJU,None

尝试下面的代码，但它不起作用

sqlContext.read.option("sep", ","). \
                option("encoding", "ISO-8859-1"). \
                option("mode", "PERMISSIVE").csv(<path>).rdd.map(lambda s: s.replace("\xe2",""))

Answer 1

您可以使用textFile的{{1}}功能，并使用sparkContext 删除字符串中的所有特殊字符。

string.printable

<强>解释

对于您的输入行import string sc.textFile(inputPath to csv file)\ .map(lambda x: ','.join([''.join(e for e in y if e in string.printable).strip('\"') for y in x.split(',')]))\ .saveAsTextFile(output path )
"@TSX•","None" 将字符串行拆分为for y in x.split(',')，其中["@TSX•", "None"] 表示迭代时数组中的每个元素
y 检查for e in y if e in string.printable中的每个字符 是否可打印 如果可打印则 字符将被连接以形成一串可打印字符
y 从可打印字符串中删除前面和后面的引号最后，字符串列表被.strip('\"')

转换为逗号分隔字符串

我希望解释清楚明白

Answer 2

一种选择是尝试使用string.printable过滤文字：

import string
sqlContext.read\
    .option("sep", ",")\
    .option("encoding", "ISO-8859-1")\
    .option("mode", "PERMISSIVE")\
    .csv(<path>)\
    .rdd\
    .map(lambda s: filter(lambda x: x in string.printable, s))

示例

import string rdd = sc.parallelize(["TSX•,None","MJU•,None", "!@#ABC,*()XYZ"]) print(rdd.map(lambda s: filter(lambda x: x in string.printable, s)).collect()) #['TSX,None', 'MJU,None', '!@#ABC,*()XYZ']

<强>参考

Stripping non printable characters from a string in python

如何使用pyspark删除rdd中的不可打印字符

2 个答案: