摆脱unicode错误

时间:2016-08-14 20:21:40

标签: python unicode encoding networkx

我有以下代码试图打印图表的边缘列表。看起来边缘是循环的,但是我打算在通过函数进行进一步处理时测试是否包含所有边。

def mapper_network(self, _, info):
    info[0] = info[0].encode('utf-8')
    for i in range(len(info[1])):
        info[1][i] = str(info[1][i])
    l_lst = len(info[1])
    packed = [(info[0], l) for l in info[1]] #each pair of nodes (edge)
    weight = [1 /float(l_lst)] #each edge weight
    G = nx.Graph()
    for i in range(len(packed)):
        edge_from = packed[i][0]
        edge_to = packed[i][1]
        #edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
        edge_to = edge_to.encode("utf-8")
        weight = weight
        G.add_edge(edge_from, edge_to, weight=weight)
    #print G.size()  #yes, this works :)
    G_edgelist = []
    G_edgelist = G_edgelist.append(nx.generate_edgelist(G).next())
    print G_edgelist

使用此代码,我获得错误

Traceback (most recent call last):
File "MRQ7_trevor_2.py", line 160, in <module>
MRMostUsedWord2.run()
File  "/tmp/MRQ7_trevor_2.vagrant.20160814.201259.655269/job_local_dir/1/mapper/27/mrjob.tar.gz/mrjob/job.py", line 433, in run
mr_job.execute()
File "/tmp/MRQ7_trevor_2.vagrant.20160814.201259.655269/job_local_dir/1/mapper/27/mrjob.tar.gz/mrjob/job.py", line 442, in execute
self.run_mapper(self.options.step_num)
File "/tmp/MRQ7_trevor_2.vagrant.20160814.201259.655269/job_local_dir/1/mapper/27/mrjob.tar.gz/mrjob/job.py", line 507, in run_mapper
for out_key, out_value in mapper(key, value) or ():
File "MRQ7_trevor_2.py", line 91, in mapper_network
G_edgelist = G_edgelist.append(nx.generate_edgelist(G).next())
File "/home/vagrant/anaconda/lib/python2.7/site-packages/networkx/readwrite/edgelist.py", line 114, in generate_edgelist
yield delimiter.join(map(make_str,e))
File "/home/vagrant/anaconda/lib/python2.7/site-packages/networkx/utils/misc.py", line 82, in make_str
return unicode(str(x), 'unicode-escape')
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: \ at end of string

通过以下修改

edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')  

我获得了

edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
TypeError: must be unicode, not str

如何摆脱unicode的错误?这看起来很麻烦,我非常感谢你的帮助。谢谢!!

1 个答案:

答案 0 :(得分:0)

我强烈建议您阅读此article on unicode。它给出了Python 2中unicode与字符串的一个很好的解释。

特别针对您的问题,当您致电unicodedata.normalize("NFKD", edge_to)时,edge_to必须是unicode字符串。但是,它不是unicode,因为您在此行中设置它:info[1][i] = str(info[1][i])。这是一个快速测试:

import unicodedata

edge_to = u'edge'  # this is unicode
edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
print edge_to  # prints 'edge' as expected

edge_to = 'edge'  # this is not unicode
edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
print edge_to  # TypeError: must be unicode, not str

您可以通过将edge_to转换为unicode来解决问题。

顺便说一下,似乎整个代码块的编码/解码有点令人困惑。仔细想想你想要字符串unicode和字节的位置。您可能不需要进行如此多的编码/解码/规范化。