通过登录网站进行python网络抓取

时间:2020-06-07 04:12:46

标签: python authentication web-scraping

寻找一些帮助刮刮需要登录的网站的帮助。从本质上讲,该网站是要获取交易卡价格(我认为该价格来自ebay),但其格式允许在ebays网站上搜索90天以上。登录网址是https://members.pwccmarketplace.com/login我搜索的网址是https://members.pwccmarketplace.com/,我搜索了以前的帖子,发现其中一个我认为可以尝试复制但没有成功的URL。以下是代码,无论它是否有效,任何帮助将不胜感激。

#https://stackoverflow.com/questions/47438699/scraping-a-website-with-python-3-that-requires-login
import requests
from lxml import html
from bs4 import BeautifulSoup
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
from datetime import datetime
from datetime import date
import pandas as pd
import numpy as np
from time import sleep
from random import randint
from urllib.parse import quote

Product_name = []
Price = []
Date_sold = []

url = "https://www.pwccmarketplace.com/login"
values = {"email": "xyz@abc.com",
          "password": "password"}

session = requests.Session()

r = session.post(url, data=values)

Search_name = input("Search for: ")
Exclude_terms = input("Exclude these terms (- infront of all, no spaces): ")
qstr = quote(Search_name)
qstrr = quote(Exclude_terms)
Number_pages = int(input("Number of pages you want searched (Number -1): "))

pages = np.arange(1, Number_pages)

for page in pages:

    params = {"Category": 6, "deltreeid": 6, "do": "Delete Tree"}
    url = "https://www.pwccmarketplace.com/market-price-research?q=" + qstr + "+" + qstrr + "&year_min=2004&year_max=2020&price_min=0&price_max=10000&sort_by=date_desc&sale_type=auction&items_per_page=250&page=" + str(page)

    result = session.get(url, data=params)

    soup = BeautifulSoup(result.text, "lxml")

    search = soup.find_all('tr')

    sleep(randint(2,10))

    for container in search:

代码继续,但与此问题无关。

1 个答案:

答案 0 :(得分:0)

执行POST https://members.pwccmarketplace.com/login时,有效负载中会发送一个令牌。该令牌位于input标签中,可以使用beautifulsoup进行刮取:

import requests
from bs4 import BeautifulSoup

session = requests.Session()

email = "your@email.com"
password = "your_password"

r = session.get("https://members.pwccmarketplace.com/login")

soup = BeautifulSoup(r.text, "html.parser")
token = soup.find("input", { "name": "_token"})["value"]

r = session.post(
    "https://members.pwccmarketplace.com/login",
    data = {
        "_token": token,
        "redirect": "",
        "email": email,
        "password": password,
        "remember": "true"
    }
)