Question

我一直在寻找我所拥有的但未找到/理解的错误的解决方案。基本上，如果我使用字符串函数（translate，strip等），我会得到Unicode错误（ascii'编解码器无法在位置y编码字符'x'：序数不在范围内（128）。但是当我尝试美丽的汤时处理文本，我没有得到Unicode错误，但难度（我应该说不熟悉）对我来说非常高。以下是我所拥有的代码的摘录：

...

import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs

trantab=string.maketrans(",",";") 
...

                html5 = urllib2.urlopen(address5).read()
                time.sleep(1.5)

                soup5 = BeautifulSoup(html5)

                for company in iter(soup5.findAll(height="20px")):
                    stream = ""
                    count_detail = 1
                    for tag in iter(company.findAll('td')):
                        if count_detail > 1:
                            stream = stream + string.translate(str(tag.text),trantab)
                            if count_detail < 4 :
                                stream=stream+","
                        count_detail = count_detail + 1
                    print str(storenum)+","+branch_name_address+","+ stream

...

此脚本运行一段时间，然后在stream = stream + string.translate(str(tag.text),trantab)

炸弹

基本上，我只是想在我正在处理的字段中用分号替换逗号。

此外，尝试使用string.strip删除嵌入的空格/空格，但我也遇到了类似的错误。

如何使用美丽的汤做同样的事情（用分号替换逗号并删除空格）？

或者，如果我坚持使用字符串函数，是否有代码来解决那些讨厌的Unicode错误？

Answer 1

您正在将str个对象与unicode个对象混合在一起，这导致Python解释器将一个对象强制转换为另一个对象。字符串/ Unicode强制需要编码，默认情况下假定为ascii。当这个假设不成立时，就会出现这种错误。

一般解决方案不是将str与unicode混合使用：尽可能使用unicode，并使用string.encode('utf8', 'strict')和unicode_string.decode('utf8', 'strict')明确转换任何内容（UTF-8就是一个示例））。

在这种情况下，请替换

stream = stream + string.translate(str(tag.text),trantab)

与

stream = stream + tag.text.replace(u',', u';')

Python字符串处理，Unicode＆amp;美丽的汤

1 个答案: