使用Selenium和BeautifulSoup输入内容来抓取网站?

时间:2020-01-07 21:57:10

标签: python selenium beautifulsoup

我正在尝试刮除wester union send money网站,以获取与阿根廷比索当前的“欧洲蓝色”汇率。西联汇款是唯一一家为您提供真实汇率的​​公司,并且可以在大街上交易。如果您对在阿根廷交易货币的第二市场的发展感兴趣,请查找Dollar-Blue。

我的目标是将欧元兑换成阿根廷比索。 如果要访问该网站,则必须首先单击“接受”按钮,然后键入要将钱汇至的国家/地区的名称,只有在该步骤之后才能看到汇率。

我首先尝试使用请求,因为它不能处理Java脚本,因此我切换到了selenium,并且现在已经很接近了。

我的代码如下:

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

WesternUnion = 'https://www.westernunion.com/de/en/web/send-money'

# create a new Chrome session
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get(WesternUnion)

python_button = driver.find_element_by_id('button-fraud-warning-accept')
python_button.click()

time.sleep(0.25)
python_button = driver.find_element_by_id('country')
python_button.click() #click fhsu link
time.sleep(0.15)
text_area = driver.find_element_by_id('country')
text_area.send_keys("Argentina")

soup = BeautifulSoup(driver.page_source, 'lxml')

div = soup.find('div', id="OptimusApp")
div2 = soup.find('div', class_="text-center")

问题在于,如果我使用python(screenshot navigated automatic with python)进行操作,则不会显示汇率,而如果我手工进行完全相同的操作(screenshot navigated by hand,则会显示汇率。 )。

我对Scraping和python还是很陌生,有人对这个问题有简单的解决方案吗?

2 个答案:

答案 0 :(得分:1)

我对您的代码进行了一些修改,添加了几个可选参数,执行后得到以下结果:

  • 代码块:

    NavbarComponent.html
  • 控制台输出:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get('https://www.westernunion.com/de/en/web/send-money')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#button-fraud-warning-accept"))).click()
    python_button = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#country")))
    python_button.click()
    python_button.send_keys("Argentina")
    print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span#smoExchangeRate"))).text)
    
  • 观察:我的观察与您的观察相似,未显示汇率

snapshot


深潜

在检查网页的DOM Tree时,您会发现1.00 EUR = Argentine Peso (ARS) <script>标签中的某些标签是 JavaScript 关键字为 dist < / strong>。例如:

  • <link>
  • <script src="/content/wucom/dist/2.7.1.8f57d9b1/js/smo-configs/smo-config.de.js"></script>
  • <link rel="stylesheet" type="text/css" href="/content/wucom/dist/2.7.1.8f57d9b1/css/responsive_css.min.css">
  • <link rel="stylesheet" href="https://nebula-cdn.kampyle.com/resources/dist/assets/css/liveform-web-vendor-f84dfc85d6.css">
  • <link rel="stylesheet" href="https://nebula-cdn.kampyle.com/resources/dist/assets/css/kampyle/liveform-web-style-a4ce961d15.css">
  • <script src="https://nebula-cdn.kampyle.com/resources/dist/assets/js/liveform-web-vendor-919a2c71c3.js"></script>

明确表明该网站受 Bot Management 服务提供商Distil Networks保护,并且检测到 ChromeDriver 的导航,随后阻止了该导航


Distil

根据文章There Really Is Something About Distil.it...

Distil通过观察站点行为并识别刮板特有的模式来保护站点免受自动内容抓取机器人的攻击。当Distil在一个站点上识别出一个恶意bot时,它将创建一个列入黑名单的行为配置文件,并将其部署到所有客户。像漫游器防火墙一样,Distil会检测模式并做出反应。

进一步

<script src="https://nebula-cdn.kampyle.com/resources/dist/assets/js/liveform-web-app-2c4e3adeb6.js"></script>,Distil首席执行官拉米·埃塞伊(Rami Essai)在上周的一次采访中表示。 "One pattern with Selenium was automating the theft of Web content"


参考

您可以在以下位置找到一些详细的讨论:

答案 1 :(得分:0)

变化率来自带有 POST 请求的 https://www.westernunion.com/wuconnect/prices/catalog。例如:

  • 假设一个 $payload 变量包含:
{
  "header_request": {
    "version": "0.5",
    "request_type": "PRICECATALOG",
    "correlation_id": "web-x",
    "transaction_id": "web-x"
  },
  "sender": {
    "client": "WUCOM",
    "channel": "WWEB",
    "cty_iso2_ext": "DE",
    "curr_iso3": "EUR",
    "funds_in": "*",
    "send_amount": 300,
    "air_requested": "Y",
    "efl_type": "STATE",
    "efl_value": "CA"
  },
  "receiver": {
    "curr_iso3": "ARS",
    "cty_iso2_ext": "AR",
    "cty_iso2": "AR"
  }
}
  • 假设一个无辜的用户代理
  • 然后 curl -s 'https://www.westernunion.com/wuconnect/prices/catalog' --data-raw "$payload" | jq '.services_groups[0].pay_groups[0] | .fx_rate' 会得到它。

它曾经有效(直到几周前)。

但端点现在受到保护:它需要从浏览器计算的一组自定义加密标头,并严重依赖于混淆和涉及的 Javascript。它们的外观如下:

X-NYUPe9Cs-a: IExHQTfwEnWwuyWbWjmR2fyBEQW9X9nnqFqIio78zzCKFA78iBDudN=NnOpQd=725d_urqfAN2sKK7UOdTnkCpUqFvQ9TF2nK=M1jDmrMBYy-4iq5kUqSdEN1PjBjEC=Nx742P1np7qAKK8q8qWd5UQIQ8Wqnqx51np7kIavPFenB9dSvnKou0A2nfv7qE-q7k_2EdNyuKffAYxcqbnjnCYIDfe=IKCc8JdPzpDecynafP1fVKq=z2SJCKiaMXu-Dxp2z5CpfznOPcs4WFH2D4C5JTTnDDUQ7vOPFVKnKCdcamPqOnK8wOQb9FYoxWs=Pksn4vmeC5Ia9EoVReH8uj0q_PRu2q522kk-9jnRTYJIP9VWP_50hhxPMds9eX_kAC2DbBnKzy24sICkO7bkkyAT82s5YuKECP=fnzXixxC8=81WX4jqnNBJ_qxbbqV=InUWmKYWimbUaB5qwOCA2iqSXNDw25PmHq8_2XEAx7nTnjkwYS2qvNBa8sAjxxHU8ibNFr_iiZH=4JuS2Q=RJrnTDonA1vFxKe812s-CMJ8HFay0VqrC2kQZVzCV2w0bqZyEuJksehxE22W8-Smd5V5XnvENHFcn72wkeN=boc=PIbv=XYNqEknrCyEX2r8BJvYCipnKdnkohrIvPovqfJMB7emybSTy2Eeu9h9VBrqYMW2NrXb2wc1kxC5WJAFv_cXE_vqsvRqeS-wYJ9vD1Y-1Cvo8RRqkFWAXuq1CBYXndSQ_A1e0aqO7sTB=nyKFd1=rJ4=z15z-qFMEQfy_x=qedJTzvWf8SE9yMqVCYUuSrhMnpEFdeJYiEdX-KS2In0-uZ0zzrn2qn27zY-jo7qkrvrq8V8v2aACd7PFEnMbCyUUUI-MdTcD8nCDiC2yuPOpbUcwID7Y1d=2aIubdAhErSn82C9FnSm9IVj8Z_WHwBvBPCI_o=_2pdRVk0jS5qYb_OjyVrrxqXnZOp9TVnAVnWZOWn798a8qhX-hYuFjJ-z84rzQRo2M70vHAMuNSMT_8yqkrujEr7JcyU2CmY1NKpev0w8R19227=qVqdemsq00nx-UAYz0=UYA2hT2IaqoqRie7Jbzjikb2snnnQynoHUpnYxRVs9ORc7I2MVhqqCVonnVk5Pi1xns2--iqqSKH8Rhium-nRcWurBu=TFiZ-5Qq-_WDiMQ5n7BqmAZkjWZM97MNkqakw8nq9CXav2fq4OqUok997VTOFkP7DEm-W5ckkwInQNMBNqTrK25DnSHRiyP5m5zqh1RjWp48f_9QCO2HiPS9A8j58zoF_8abn0H1qUERd_Cq8-7zqOnkEeAAWCywi18wUD5qfbQd22BJDNq90sMSbNVsJy0P2CBf-hq9fjSCB=uA5y8xT2-CJunFwUCx85ujxiq-bu5BAbSpqUCAXDP8iq02ET5-xRq7CD22n=E4keqVnKpzq2=RUKWP_jDnsiKRn4xxsRM0QYnbCC=m2KjCE9BjJ1nrn8EDvUS52bmaixqosRq5SNOPEHKyrQy8nqI9E9OAMYm5=TpVNvn-oqeDF_-jkcqIdyHqn1QYxaZbn4xVFqIOzQ9eV7A9QbC5zPcPeD=qqpqqK=YxNzKwTSCOnA70SrhiB2r1VkqKuuBJQYZoIC_87Mmuo8znpQnH29fI7Oh99sKO5aoEQIMOrIDwQDZvWqwwH=ZKnnn8T=5o9MTdDkpr472DPdqOEq8Ffii0q00r8OwkZX_oXY2UEKdCaX88zZamSqaY8iZzqiIYdeMjqMFKqVAv-82PxBWQv1Kr1OibYSh0QTp14BqBhEf-WKrVECI_y7517nZa8ndFpjznkfcnY2KufY0iFwnx2zx99iuUbF84nerZH88Rxx=pKBbsjeqJZ-0xZScnrn9hReJ--oh40mcxMXn1V0PzwcMaEACo0dWouDZeZYHViqd9RQAnso2DIF-wI-Pe_q5srKK8nmCZNI2hZqwjzOM7bwF4_4-S=9BzYFDaYw0SknMJTq9VReaM297ir-CYsdM9VN29TpDRnC=8aQ5o9yXZpEDyfqmJuwzs7N7he8FPrfIdDVK5iaW8Jm8YcHnqnno7EHSqKeTRNuzkeHqcn0u87OX=ByhQMQJ4QacaxqqFVmPqQEHSVbx1PsQDq780PWDKbvK5PBMnZksBZm0VIOHxu_q2xnfPWsixuqaIm2sXn2Jz2yByvdNeT5r2F14zEaiiEFfNqICZ_DHCXpr2K4HURNd5n_vyJTe2UVakZE_9T01W9cFUxBOur0xfN0=h4vmOoUAnwISSDxc5EmAefWviW2PvqevpnnS7YuMPMY5aHi2c2RrP=i-mfPpKzRSHpAn82sJ9izMdWcWq=qI5O_UBm==vFHrFOzHQK8AH9qcRM8=KHpwyoV-b0WzuErxZhZmMV_iKors2JCAeWn-jn-q_Mrqau1Xz88nTBQFO=vnKPfFoqY9Z81KUqyAn2N5dwbnKWHUZh4Ke4OnyOr=22=rKZneB9PmQDUDq=97vOSqqNq=bHNriSf=xT48cXy7AqWOnncwEqwbVcA25ds8O8S0WI9=ipEfIyiiJ7qSMoHY=kn7rwiE94jsVx5n7Syj=m58Fqvi=HCFI0Bwf8byFhWbeJsAK5UaDqchCY5qC9n-OUqmeJHay8OAqm-HQPnP9qBfyd08nini0FsrdvHmru4qA=sK4OKmzcY_wSj8D8D2jBQWHF2avq4UP8-D2Ysh4C_bXXhqmqK9RPyuXRoeC5Oad-FmUXy_5F_r0OKEnrAMC
X-NYUPe9Cs-f: A_v7kP18AQAAbfq9_kCtmTqfX2Eq0otHnwqUQCck5dPjX88Nxz2rTVnAnVxYAcmzs1ScuAA7wH8AADQwAAAAAA==
X-NYUPe9Cs-b: -8qa21q
X-NYUPe9Cs-c: AOBWjv18AQAAqntYtdrBc9F0C0KawiRISfcOH_ruhEoV4NNn-IemnXnq5vi1
X-NYUPe9Cs-d: AAaixIihDKqOocqASZAQjICihCKHpi15Rub4tUEPqzn1Pxi1AAd7zRXqBBDKOTmM_r5nbhq
X-NYUPe9Cs-z: q

这组标头仅在有限的时间内有效(不超过 24 小时 AFAICT)。

我很好奇有人会进一步查明逻辑所在的位置(一些加密初始化向量可能由初始页面加载期间传送的 cookie 提供)。如果是这样,node.js 可以计算这组标头。