Question

我必须先在CSV中导入Spark文件，然后先在DF和RDD中对其进行转换。

首先，我将完整的CSV文件导入为DF

stopwords_df = (
    sqlc
    .read
    .format('csv')
    .option('header', True)
    .option('delimiter', ';')
    .option('encoding', 'latin1')
    .load('/mnt/sparkdata/stopwords.csv', schema = stopSchema)
    .repartition(72)
)

然后我只选择合适的单词并将其转换为一个集合

stopwords_set = set(
    stopwords_df
    .filter(f.col('retain').isNull())
    .rdd
    .map(lambda x: x[0].encode('latin1')) # the [0] is to extract strings from Rows
    .collect()
)

我搞砸了编码，不知道如何解决。

如果我“显示” DF，则拉丁字母将正确显示（sperò）

stopwords_df.show(100, truncate = False)

+--------------+--------+------+----------+------+
|word          |language|type  |doubletype|retain|
+--------------+--------+------+----------+------+
|informava     |IT      |verbo |null      |null  |
|sperò         |IT      |verbo |null      |null  |
|four          |EN      |null  |null      |null  |

但是如果我显示RDD不会发生这种情况

(
    stopwords_df
    .filter(f.col('word') == r'sperò')
    .rdd
    .first()
)

Row(word=u'sper\xf2', language=u'IT', type=u'verbo', doubletype=None, retain=None)

使用UTF-8 encoding也会变得更糟

+--------------+--------+------+----------+------+
|word          |language|type  |doubletype|retain|
+--------------+--------+------+----------+------+
|thanks        |EN      |saluto|null      |null  |
|fossero       |IT      |verbo |null      |null  |
|sper�         |IT      |verbo |null      |null  |

您能建议我如何解决此问题吗？

Answer 1

看到此行之后：

Row(word=u'sper\xf2)

它确实表示您正在使用Python3。Python3的默认编码为utf-8，默认情况下支持ò。

因此，当您指定将其编码为latin1时，ò被替换为\ xf2。

为什么不收集没有latin1编码的东西？

stopwords_set = set(
    stopwords_df
    .filter(f.col('retain').isNull())
    .rdd
    .collect()
)

让我知道是否有帮助。谢谢。

拉丁字母的PySpark DF和RDD编码

1 个答案: