Question

我有点累了，但是这里有：

我正在使用ubuntubox上的BeautifulSoap在python 2.6.5中进行HTML抓取

python 2.6.5的原因：BeautifulSoap在3.1下很糟糕

我尝试运行以下代码：

# dataretriveal from html files from DETHERM
# -*- coding: utf-8 -*-

import sys,os,re,csv
from BeautifulSoup import BeautifulSoup


sys.path.insert(0, os.getcwd())

raw_data = open('download.php.html','r')
soup = BeautifulSoup(raw_data)

for numdiv in soup.findAll('div', {"id" : "sec"}):
    currenttable = numdiv.find('table',{"class" : "data"})
    if currenttable:
        numrow=0
        numcol=0
        data_list=[]
        for row in currenttable.findAll('td', {"class" : "dataHead"}):
            numrow=numrow+1
        for ncol in currenttable.findAll('th', {"class" : "dataHead"}):
            numcol=numcol+1
        for col in currenttable.findAll('td'):
            col2 = ''.join(col.findAll(text=True))
        if col2.index('±'):
        col2=col2[:col2.index('±')]
            print(col2.encode("utf-8"))
        ref=numdiv.find('a')
        niceref=''.join(ref.findAll(text=True))

现在由于±符号，我在尝试解释代码时遇到以下错误：

python code.py

追踪（最近一次通话）：文件“detherm-wtest.py”，第25行，in 如果col2.index（'±'）： UnicodeDecodeError：'ascii'编解码器无法解码位置0的字节0xc2：序数不在范围内（128）

我该如何解决这个问题？把你放进去我们有：'±' - ＆gt;你'''结果：

追踪（最近一次通话）：文件“detherm-wtest.py”，第25行，in 如果col2.index（u'±'）： ValueError：找不到子字符串

当前代码文件编码为utf-8

谢谢

Answer 1

像"±"这样的字节字符串（在Python 2.x中）是在源文件的编码中编码的，这可能不是你想要的。如果col2实际上是一个Unicode对象，则应该使用u"±"，而不是像您已经尝试过的那样。您可能知道somestring.index如果找不到事件则引发异常，而somestring.find返回-1。所以，这个

    if col2.index('±'):
        col2=col2[:col2.index('±')] # this is not indented correctly in the question BTW
        print(col2.encode("utf-8"))

应该是

    if u'±' in col2:
        col2=col2[:col2.index(u'±')]
        print(col2.encode("utf-8"))

以便if语句不会导致异常。

Python 2.6中的特殊字符使用

1 个答案: