Question

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 261: ordinal not in range(128)

以上就是我在制作新桌子后所得到的。实际上我使用了以下命令; table3 = u' '.join((table1, table2)).encode('utf-8').strip() 但它不起作用，我将为每个RDD提供我的代码和输出。

创建第一个RDD的代码

table1=sc.textFile('inventory').map(lambda line:next(csv.reader([line]))).map(lambda fields:((fields[0],fields[8],fields[10]),1))

第一个RDD实际输出

[(('BibNum', 'ItemCollection', 'ItemLocation'), 1),
(('3011076', 'ncrdr', 'qna'), 1),
 (('2248846', 'nycomic', 'lcy'), 1)]

创建第二个RDD的代码

table2=sc.textFile('checkouts').map(lambda line:next(csv.reader([line]))).map(lambda fields:((fields[0],fields[3],fields[5]),1))

第二个RDD实际输出

[(('BibNum', 'ItemCollection', 'CheckoutDateTime'), 1),
(('1842225', 'namys', '05/23/2005 03:20:00 PM'), 1), 
(('1928264', 'ncpic', '12/14/2005 05:56:00 PM'), 1),
(('1982511', 'ncvidnf', '08/11/2005 01:52:00 PM'), 1),
(('2026467', 'nacd', '10/19/2005 07:47:00 PM'), 1)]

最后，我尝试使用代码table3 = u' '.join((table1, table2)).encode('utf-8').strip()来连接table1和table2。但它没有用。如果您对此错误有任何疑问，请赐教。

Answer 1

让我们看看我是否理解你的需要你有两个rdds，你想根据两个值加入它们首先，你可以清理rdd的标题（第一行）然后定义连接的键，然后进行连接我将使用稍微不那么有效但易于理解的清洁方法
（你可以在这里找到一个更有效的方法：Remove first element in RDD without using filter function）

rdd1 = sc.parallelize([(('BibNum', 'ItemCollection', 'ItemLocation'), 1),
(('3011076', 'ncrdr', 'qna'), 1),
 (('2248846', 'nycomic', 'lcy'), 1),
                       (('1928264', 'ncpic', '12/14/2005 05:56:00 PM'), 1)])

rdd2 = sc.parallelize([(('BibNum', 'ItemCollection', 'CheckoutDateTime'), 1),
(('1842225', 'namys', '05/23/2005 03:20:00 PM'), 1), 
(('1928264', 'ncpic', '12/14/2005 05:56:00 PM'), 1),
(('1982511', 'ncvidnf', '08/11/2005 01:52:00 PM'), 1),
(('2026467', 'nacd', '10/19/2005 07:47:00 PM'), 1)])

rdd1_for_join = rdd1.zipWithIndex().filter(lambda x: x[1] != 0)\
.map(lambda x: ( (x[0][0][0], x[0][0][1]), x[0] ))

rdd2_for_join = rdd2.zipWithIndex().filter(lambda x: x[1] != 0)\
.map(lambda x: ( (x[0][0][0], x[0][0][1]), x[0] ))

print rdd1_for_join.join(rdd2_for_join).collect()

[（（＆＃39; 1928264＆＃39;，＆＃39; ncpic＆＃39;），（（＆＃39; 1928264＆＃39;，＆＃39; ncpic＆＃39;，＆＃39; 12/14/2005 05:56:00 PM＆＃39;），1），（（＆＃39; 1928264＆＃39;，＆＃39; ncpic＆＃39;，＆＃39; 12/14/2005 05：下午56:00＆＃39;），1）））]

Pyspark，当试图加入两个RDD时，我收到了一个UnicodeEncode错误

1 个答案: