Question

我正试图从中文网站获取数据。我发现它在html中的位置，但需要帮助提取文本。我到目前为止：

from bs4 import BeautifulSoup
import requests

page = 'http://sbj.speiyou.com/search/index/subject:/grade:12/gtype:time'
r = requests.get(page)

r.encoding = 'utf-8'
soup = BeautifulSoup(r.text)

div = soup.find('div', class_='pagination mtop40')

我正在寻找的数据是16中的1/16。

Answer 1

在div.text上使用正则表达式是一种选择。以下正则表达式查找任何数字后跟正斜杠后跟更多数字。

import re
pattern = re.compile(r'\d+\/\d+')
matches = re.search(pattern, div.text)
num = matches.group(0) # num = 1/16 here
print num.split('/')[1]

或

import re pattern = re.compile(r'\d+\/(\d+)') # Group the needed data in the regex pattern matches = re.search(pattern, div.text) print matches.group(0)

python beautifulsoup从中文网站获取页脚

1 个答案: