我正在尝试使用Beautiful Soup从URL中提取数字的代码,然后对这些数字求和,但我不断收到如下错误:
预期的字符串或缓冲区
我认为它与正则表达式有关,但我无法查明问题。
import re
import urllib
from BeautifulSoup import *
htm1 = urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/comments_42.html').read()
soup = BeautifulSoup(htm1)
tags = soup('span')
for tag in tags:
y = re.findall ('([0-9]+)',tag.txt)
print sum(y)
答案 0 :(得分:1)
我建议使用bs4
而不是BeautifulSoup
(旧版本)。您还需要更改此行:
y = re.findall ('([0-9]+)',tag)
这样的事情:
y = re.findall ('([0-9]+)',tag.text)
看看这是否能让你更进一步:
sum = 0 #initialize the sum
for tag in tags:
y = re.findall ('([0-9]+)',tag.text) #get the text from the tag
print(y[0]) #y is a list, print the first element of the list
sum += int(y[0]) #convert it to an integer and add it to the sum
print('the sum is: {}'.format(sum))