我正在尝试抓取一个网站。我想要的数据不包含在div或class中,它是一个push变量。我希望能够搜索“ average180.push([new Date(””),然后我想获取紧随其后的字符。例如,我想获取“中包含的以下字符并将其分配给列表(在这种情况下,它是日期),然后我想抓取逗号中包含的紧随其后的文本(价格值),然后将其分配给列表。一旦有了这两个列表,我就可以压缩一起创建我的数据表
我目前有什么
Mountain Lion
答案 0 :(得分:0)
您可以使用re模块来解析参数:
import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint
url = "http://services.runescape.com/m=itemdb_rs/Raw_shark/viewitem?obj=383"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
return_value = []
for s in soup.find_all('script'):
i = re.findall("average180.push\(\[new Date\('(.*?)'\).*?(\d+).*?(\d+)", s.text)
for v in i:
return_value.append(v)
pprint(return_value)
输出:
[('2018/01/13', '1584', '1389'),
('2018/01/14', '1530', '1396'),
('2018/01/15', '1512', '1402'),
('2018/01/16', '1501', '1408'),
('2018/01/17', '1489', '1414'),
('2018/01/18', '1483', '1420'),
('2018/01/19', '1487', '1427'),
('2018/01/20', '1511', '1435'),
('2018/01/21', '1516', '1443'),
('2018/01/22', '1517', '1449'),
('2018/01/23', '1529', '1456'),
('2018/01/24', '1527', '1463'),
('2018/01/25', '1524', '1470'),
('2018/01/26', '1498', '1477'),
('2018/01/27', '1491', '1484'),
...etc.