Question

我正在尝试访问此网站以获取信息： https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0

我尝试编写适用于其他网站的代码，但它只留下了一个空文本文件。而不是像其他网站那样填写数据。这是我的代码：

import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import json
import time
outfile = open('/Users/Luca/Desktop/test/farm_data.text','w')
my_list = list()

site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=B&b=1&page=0"
my_list.append(site)
site = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=C&b=1&page=0"
my_list.append(site)


for item in my_list:
    time.sleep( 5 )
    html = urlopen(item)
    bsObj = BeautifulSoup(html.read(), "html.parser")
    nameList = bsObj.prettify().split('.')
    count = 0
    for name in nameList:
            print (name[2:])
            outfile.write(name[2:] + ',' + item + '\n')

我试图把它分成更小的部分然后从那里开始。我在以下网站上使用过此代码：https://www.mtggoldfish.com/price/Aether+Revolt/Heart+of+Kiran#online

例如，它有效。

为什么它适用于某些网站而不是其他网站？非常感谢。

Answer 1

有问题的网站可能不允许使用网页抓取，这就是为什么你会得到：

HTTPError: HTTP Error 403: Forbidden

您可以通过伪装成浏览器代理来欺骗您的用户代理。以下是使用精彩requests模块进行操作的示例。您在发出请求时会传递User-Agent标题。

import requests

url = "https://farm.ewg.org/addrsearch.php?stab2=NY&fullname=A&b=1&page=0"
html = requests.get(url, headers={'User-Agent' : 'Mozilla/5.0'}).text
bsObj = BeautifulSoup(html, "html.parser")
print(bsObj)

输出：

<!DOCTYPE doctype html>    
<html class="no-js" lang="en" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
<head>
<meta charset="utf-8"/>
.
.
.

您现在可以将此代码按到您的循环中。

美丽的汤不提供网站的数据

1 个答案: