所以我试图访问zillow URL。通过浏览器访问时,它与我通过代码看到的不同。详情如下。
CURL
curl 'http://www.zillow.com/homes/KY_rb/' -H 'Host: www.zillow.com' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Referer: http://www.zillow.com/homes/fsbo/featured_sort/47.368594,-68.686523,28.110749,-124.936523_rect/3_zm/' -H 'Cookie: JSESSIONID=D9BF4E280B16431893C3A11A8FC3F825; abtest=3|DO8RElLJuj2felZqqw; zguid=23|%24b42a26dc-8387-4086-b000-cc49ddfbc450; search=6|1480915840720%7Crect%3D47.368594%252C-68.686523%252C28.110749%252C-124.936523%26zm%3D3%26disp%3Dmap%26mdm%3Dauto%26p%3D1%26sort%3Dfeatured%26z%3D1%26lt%3Dfsbo%26fs%3D1%26fr%3D0%26mmm%3D1%26rs%3D0%26ah%3D0%26singlestory%3D0%09%01%09%09%09%092%090%09US_%09; F5P=3005270026.0.0000; _ga=GA1.2.1136269898.1478324471; _gat=1; __gads=ID=3f2f3e2d6e19b149:T=1478323799:S=ALNI_Mava6ZGjT_MrRhAVG7ndewcDCN60A; ipe_s=fbc57b01-3937-f803-5da1-5c4887cc949d; _bizo_bzid=aa621351-3627-408d-8838-440c1bd3f163; _bizo_cksm=EE838E07FF3AF15E; ipe.29115.pageViewedCount=1; _bizo_np_stats=14%3D1028%2C' -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1'
卷曲给出了正确的结果。
Fetch.py
import requests
from bs4 import BeautifulSoup
from time import sleep
import xmltodict
state = 'KY'
url = 'http://www.zillow.com/homes/' + state + '_rb/'
property_urls = []
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
'upgrade-insecure-requests': 1,
'accept-language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}
try:
session = requests.session()
r = session.get(url, headers=headers, timeout=5)
sleep(2)
html = html = r.text
soup = BeautifulSoup(html, 'lxml')
print(html)
except requests.ConnectionError as e:
print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
print(str(e))
except requests.Timeout as e:
print("OOPS!! Timeout Error")
print(str(e))
except requests.RequestException as e:
print("OOPS!! General Error")
print(str(e))
except KeyboardInterrupt:
print("Someone closed the program")
finally:
print("Total Properties = " + str(len(property_urls)))
try:
# file to store state based URLs
state_file = open(state + '_file.txt', 'a+')
state_file.write("\n".join(property_urls))
state_file.close()
except Exception as ex:
print("Unable to store records in CSV file. Techncical details below.\n")
print(str(e))
答案 0 :(得分:0)
不确定different data
的意思(可以表示任何意思,略有不同,完全不同等)。您的curl正在使用--compressed
,实际上意味着请求标头Accept-Encoding: deflate, gzip
。尝试从python代码中添加该标头。