Question

我正在尝试使用Beautiful Soup从URL中提取数字的代码，然后对这些数字求和，但我不断收到如下错误：

预期的字符串或缓冲区

我认为它与正则表达式有关，但我无法查明问题。

import re
import urllib

from BeautifulSoup import *
htm1 = urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/comments_42.html').read()
soup = BeautifulSoup(htm1)
tags = soup('span')

for tag in tags:
    y = re.findall ('([0-9]+)',tag.txt)

print sum(y)

Answer 1

我建议使用bs4而不是BeautifulSoup（旧版本）。您还需要更改此行：

y = re.findall ('([0-9]+)',tag)

这样的事情：

y = re.findall ('([0-9]+)',tag.text)

看看这是否能让你更进一步：

sum = 0  #initialize the sum
for tag in tags:
    y = re.findall ('([0-9]+)',tag.text)  #get the text from the tag                                                                                                                                    
    print(y[0])  #y is a list, print the first element of the list                                                                                                                                      
    sum += int(y[0])  #convert it to an integer and add it to the sum                                                                                                                                   

print('the sum is: {}'.format(sum))

＆＃34;预期的字符串或缓冲区＆＃34;错误使用美丽的汤

1 个答案: