Question

我最近开始学习python，我做的第一个项目之一是从我儿子的教室网页上删除更新，并向我发送他们更新网站的通知。事实证明这是一个简单的项目，所以我想扩展这个并创建一个脚本，自动检查我们的乐透号码是否有任何影响。不幸的是，我无法弄清楚如何从网站上获取数据。这是我昨晚的一次尝试。

from bs4 import BeautifulSoup
import urllib.request

webpage = "http://www.masslottery.com/games/lottery/large-winningnumbers.html"

websource = urllib.request.urlopen(webpage)
soup = BeautifulSoup(websource.read(), "html.parser")

span = soup.find("span", {"id": "winning_num_0"})
print (span)

Output is here...
<span id="winning_num_0"></span>

如果我使用网络浏览器“查看来源”，上面列出的输出也是我看到的。当我使用Web浏览器“检查元素”时，我可以在inspect元素面板中看到中奖号码。不幸的是，我甚至不确定Web浏览器获取数据的方式/位置。它是从另一个页面加载还是在后台加载脚本？我认为以下教程将帮助我，但我无法使用类似的命令获取数据。

http://zevross.com/blog/2014/05/16/using-the-python-library-beautifulsoup-to-extract-data-from-a-webpage-applied-to-world-cup-rankings/

感谢任何帮助。感谢

Answer 1

如果仔细查看页面的来源（我刚使用curl），您可以看到此块

<script type="text/javascript">
    // <![CDATA[
    var dataPath = '../../';
    var json_filename = 'data/json/games/lottery/recent.json';
    var games = new Array();
    var sessions = new Array();
    // ]]>
</script>

recent.json像拇指一样伸出来（我实际上错过了dataPath部分）。

尝试之后，我想出了这个：

curl http://www.masslottery.com/data/json/games/lottery/recent.json

正如lari在评论中指出的那样，比抓取HTML更容易。事实很简单：

import json
import urllib.request
from pprint import pprint

websource = urllib.request.urlopen('http://www.masslottery.com/data/json/games/lottery/recent.json')
data = json.loads(websource.read().decode())
pprint(data)

data现在是一个词典，你可以做任何你想做的类似词典的事情。祝你好运;）

使用python从网站中提取数据

1 个答案: