I'm having trouble scraping a website using BeautifulSoup4 and Python3. I'm using dryscrape to get the HTML since it requires JavaScript to be enabled in order to be shown (but as far as I know it's never used in the page itself).
This is my code:
from bs4 import BeautifulSoup
import dryscrape
productUrl = "https://www.mercadona.es/detall_producte.php?id=32009"
session = dryscrape.Session()
session.visit(productUrl)
response = session.body()
soup = BeautifulSoup(response, "lxml")
container1 = soup.find("div","contenido").find("dl").find_all("dt")
container3 = soup.find("div","contenido").find_all("td")
Now I want to read container3
content, but:
type(container3)
Returns:
bs4.element.ResultSet
which is the same as type(container1)
, but it's length it's 0!
So I wanted to know what was I getting to container3
before looking for my <td>
tag, so I wrote it to a file.
container3 = soup.find("div","contenido")
soup_file.write(container3.prettify())
And, here is the link to that file: https://pastebin.com/xc22fefJ
It gets all messed up just before the table I want to scrape. I can't understand why, looking at the URL source code from Firefox everything looks fine.
答案 0 :(得分:0)
这是更新后的解决方案:
url = 'https://www.mercadona.es/detall_producte.php?id=32009'
rh = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.mercadona.es',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
s = requests.session()
r = s.get(url, headers = rh)
对此的响应会为您提供Please enable JavaScript to view the page content.
消息。但是,它还包含浏览器使用javascript发送的必要hidden
数据,可以从开发人员工具的网络标签中看到。
TS015fc057_id: 3
TS015fc057_cr: a57705c08e49ba7d51954bea1cc9bfce:jlnk:l8MH0eul:1700810263
TS015fc057_76: 0
TS015fc057_86: 0
TS015fc057_md: 1
TS015fc057_rf: 0
TS015fc057_ct: 0
TS015fc057_pd: 0
其中,第二个(长字符串)是由javascript生成的。我们可以使用像js2py
之类的库来运行代码,该代码将返回在请求中传递的必需字符串。
soup = BeautifulSoup(r.content, 'lxml')
script = soup.find_all('script')[1].text
js_code = re.search(r'.*(function challenge.*crc;).*', script, re.DOTALL).groups()[0] + '} challenge();'
js_code = js_code.replace('document.forms[0].elements[1].value=', 'return ')
hidden_inputs = soup.find_all('input')
hidden_inputs[1]['value'] = js2py.eval_js(js_code)
fd = {i['name']: i['value'] for i in hidden_inputs}
rh = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Referer': 'https://www.mercadona.es/detall_producte.php?id=32009',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Content-Length': '188',
'Content-Type': 'application/x-www-form-urlencoded',
'Cache-Control': 'max-age=0',
'Host': 'www.mercadona.es',
'Origin': 'https://www.mercadona.es',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
# NOTE: the next one is a POST request, as opposed to the GET request sent before
r = s.post(url, headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
结果如下:
>>> len(soup.find('div', 'contenido').find_all('td'))
70
>>> len(soup.find('div', 'contenido').find('dl').find_all('dt'))
8
修改强>
显然,javascript代码只需要运行一次。结果数据可用于多个请求,如下所示:
for i in range(32007, 32011):
r = s.post(url[:-5] + str(i), headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
print(soup.find_all('dd')[1].text)
结果:
Manzana y plátano 120 g
Manzana y plátano 720g (6x120) g
Fresa Plátano 120 g
Fresa Plátano 720g (6x120g)