Question

对于一个项目，我要从另一个网站废弃数据，而我遇到一个问题。

当我查看源代码时，我想要的东西都在一个表中，所以它似乎很容易废弃。但是当我运行我的脚本时，部分代码源没有显示。

这是我的代码。我尝试了不同的东西。起初没有任何标题，然后我添加了一些但没有区别。

# import libraries
import urllib2
from bs4 import BeautifulSoup
import csv  
import requests

# specify the url 
quote_page = 'http://www.airpl.org/Pollens/pollinariums-sentinelles'

# query the website and return the html to the variable 'page'
response = requests.get(quote_page)  
response.addheaders = [('User-agent', 'Mozilla/5.0')]
print(response.text)

# parse the html using beautiful soap and store in variable `response`
soup = BeautifulSoup(response.text, 'html.parser')  

with open('allergene.txt', 'w') as f:
    f.write(soup.encode('UTF-8', 'ignore'))

我在网站上寻找的是“Herbacée”之后的HTML看起来像：

<p class="level1">

      <img src="/static/img/state-0.png" alt="pas d'émission" class="state">

    Herbacee
  </p>

你知道什么是错的吗？

感谢您的帮助和新年快乐的人们:)

Answer 1

此页面使用JavaScript来呈现表格，实际页面包含的表格是：

http://www.alertepollens.org/gardens/garden/1/state/

您可以在Chrome开发工具＆gt;＆gt;＆gt;网络中找到此网址。

浏览器中的HTML与python中的抓取数据不对应

1 个答案: