使用request.post()从网页上抓取文字

时间:2019-05-19 18:40:58

标签: python web-scraping python-requests lxml

我想从房地产列表网页上抓取文字。当我预先知道URL时,我就成功了,但是我无法搜索邮政编码,然后刮取搜索结果。

# I know the URL, and I can scrape data from the page successfully
from lxml import html
import requests
url = 'https://www.mlslistings.com/Search/Result/6b1a2c4f-3976-43d8-94a7-5742859f26f1/1' # this URL is the page that follows a zip code search on the 'mlslistings.com' homepage
page = requests.get(url)
tree = html.fromstring(page.content)
address_raw = list(map(str, tree.xpath('//a[@class="search-nav-link"]//text()'))) # returns addresses found on listings page
# I want to do the zip code search on the homepage, and scrape the page that follows, but this time get an empty list
url = 'https://www.mlslistings.com/'
data = {'transactionType': 'buy', 'listing_status': 'Active', 'searchTextType': '', 'searchText': '94618','__RequestVerificationToken': 'CfDJ8K_Ve2wchEZEvUasrULD6jPUmwSLRaolrWoc10T8tMJD8LVSE2c4zMKhNIRwuuwzLZPPsypcZzWaXTHX7Unk1NtVdtAIqIY8AL0DThPMv3xwVMhrzC8UumhLGSXh00oaDHDreGBlWXB2NmRAJi3MbqE'}
post = requests.post(url, data=data)
tree = html.fromstring(post.content)
address_raw = list(map(str, tree.xpath('//a[@class="search-nav-link"]//text()'))) # returns empty list! why?

2 个答案:

答案 0 :(得分:1)

您可能需要使用正确的RequestVerificationToken,可以通过首先请求主页来获得。

以下显示了可以使用BeautifulSoup提取它的方法(可以随意使用您自己的方法)。另外,您需要将发布请求提交到正确的URL。

from bs4 import BeautifulSoup
from lxml import html
import requests

sess = requests.Session()
home_page = sess.get('https://www.mlslistings.com/')
soup = BeautifulSoup(home_page.content, "html.parser")
rvt = soup.find("input", attrs={"name" : "__RequestVerificationToken"})['value']
data = {'transactionType': 'buy', 'listing_status': 'Active', 'searchTextType': '', 'searchText': '94618','__RequestVerificationToken': rvt}
search_results = sess.post("https://www.mlslistings.com/Search/ResultPost", data=data)
tree = html.fromstring(search_results.content)
address_raw = list(map(str, tree.xpath('//a[@class="search-nav-link"]//text()'))) # returns addresses found on listings page

print(address_raw)

这将为您提供以下地址:

['5351 Belgrave Pl, Oakland, CA, 94618', '86 Starview Dr, Oakland, CA, 94618', '1864 Grand View Drive, Oakland, CA, 94618', '5316 Miles Ave, Oakland, CA, 94618', '280 Caldecott Ln, Oakland, CA, 94618', '6273 Brookside Ave, Oakland, CA, 94618', '50 Elrod Ave, Oakland, CA, 94618', '5969 Keith Avenue, Oakland, CA, 94618', '6 Starview Dr, Oakland, CA, 94618', '375 62nd St, Oakland, CA, 94618', '5200 Masonic Ave, Oakland, CA, 94618', '49 Starview, Oakland, CA, 94618', '4863 Harbord Dr, Oakland, CA, 94618', '5200 Cochrane Ave, Oakland, CA, 94618', '6167 Acacia Ave, Oakland, CA, 94618', '5543 Claremont Ave, Oakland, CA, 94618', '5283 Broadway Ter, Oakland, CA, 94618', '0 Sheridan Rd, Oakland, CA, 94618']

答案 1 :(得分:1)

为避免对有效负载中的名称和值进行硬编码,同时又要即时获取验证令牌,可以尝试以下操作。该脚本位于lxml解析器上。坚持使用其中一个,但不要同时使用。

import requests
from lxml.html import fromstring

gurl = 'https://www.mlslistings.com/' #url for get requests
purl = 'https://www.mlslistings.com/Search/ResultPost' #url for post requests

with requests.Session() as session:
    r = session.get(gurl)
    root = fromstring(r.text)
    payload = {item.get('name'):item.get('value') for item in root.cssselect('input[name]')}
    payload['searchText'] = '94618'
    res = session.post(purl,data=payload)
    tree = fromstring(res.text)
    address = [item.text.strip() for item in tree.cssselect('.listing-address a.search-nav-link')]
    print(address)