试图从不在类内部而是在脚本中作为变量的网站中获取信息

时间:2018-07-04 23:25:08

标签: python web-scraping beautifulsoup

我正在尝试抓取一个网站。我想要的数据不包含在div或class中,它是一个push变量。我希望能够搜索“ average180.push([new Date(””),然后我想获取紧随其后的字符。例如,我想获取“中包含的以下字符并将其分配给列表(在这种情况下,它是日期),然后我想抓取逗号中包含的紧随其后的文本(价格值),然后将其分配给列表。一旦有了这两个列表,我就可以压缩一起创建我的数据表

我目前有什么

Mountain Lion

1 个答案:

答案 0 :(得分:0)

您可以使用re模块来解析参数:

import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint

url = "http://services.runescape.com/m=itemdb_rs/Raw_shark/viewitem?obj=383"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")

return_value = []
for s in soup.find_all('script'):
    i = re.findall("average180.push\(\[new Date\('(.*?)'\).*?(\d+).*?(\d+)", s.text)
    for v in i:
        return_value.append(v)

pprint(return_value)

输出:

[('2018/01/13', '1584', '1389'),
 ('2018/01/14', '1530', '1396'),
 ('2018/01/15', '1512', '1402'),
 ('2018/01/16', '1501', '1408'),
 ('2018/01/17', '1489', '1414'),
 ('2018/01/18', '1483', '1420'),
 ('2018/01/19', '1487', '1427'),
 ('2018/01/20', '1511', '1435'),
 ('2018/01/21', '1516', '1443'),
 ('2018/01/22', '1517', '1449'),
 ('2018/01/23', '1529', '1456'),
 ('2018/01/24', '1527', '1463'),
 ('2018/01/25', '1524', '1470'),
 ('2018/01/26', '1498', '1477'),
 ('2018/01/27', '1491', '1484'),
...etc.