Question

我运行了以下pyspark脚本来进行字数统计：

import re
inputRDD=sc.textFile("concatfile")
cleanRDD=inputRDD.map(lambda x:re.sub('[^0-9a-zA-Z ]+',"",x.upper())).flatMap(lambda x: x.split()).map(lambda x:(x,1))
reduceRDD=cleanRDD.reduceByKey(lambda x,y:x+y)
reverseKVRDD=reduceRDD.map(lambda x:(x[1],x[0]))
sortRDD=reverseKVRDD.sortByKey(ascending=False)

当我抛弃cleanRDD时，我发现键值对的格式如下：

[（u'THIS'，1），（u'IS'，1），（u'LINE'，1），（u'1'，1），（u'THIS'，1），（ u'IS'，1），（u'LINE'，1），（u'2'，1），（u'THIS'，1），（u'IS'，1），（u'LINE'， 1），（u'3'，1）]

这个你的类型是什么意思？我可以将其转换为普通的字符串类型吗？

Answer 1

在python＆＃39; u＆＃39;意思是unicode。程序返回的所有值都是unicode格式。

您可以使用编码方法将unicode转换为普通字符串。

text = u'sample text'
print type(text)

# Output
unicode

text = text.encode('utf-8')
print type(text)

# Output
str

map函数返回u类型的元素

1 个答案: