Web爬虫 - Python请求POST不返回数据

时间:2017-11-21 19:44:01

标签: python jsf web-scraping xmlhttprequest python-requests

我正在使用我的第一个网络抓取工具,而我正试图在墨西哥获取一些电话号码数据,而提供数据的网站是:site,它适用于xhr请求。 到目前为止我有这个代码:

from requests import Request, Session
import xml.etree.ElementTree as ET
import requests
import lxml.etree as etree

url = 'https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml'

s = Session()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
    'Content-Type': 'text/html; charset=UTF-8',
}

str1 = s.post(url, headers=headers) #Loading the page
xhtml=str1.text.encode('utf-8')

#Savig the first response, to get the ViewState
text_file = open("loaded.txt", "w")
text_file.write(xhtml)
text_file.close()
x = ET.fromstring(xhtml)

namespace = "{http://www.w3.org/1999/xhtml}"
path = './/*[@id="javax.faces.ViewState"]'

e = x.findall(path.format(namespace))
for i in e:
    VS = i.attrib['value'] #ViewState

print VS #ViewState

此时我获得了页面的ViewState,现在我发送一个新的POST,其中包含我想要咨询的数据和数字以及ViewState。

data = {
    "javax.faces.partial.ajax": "true",
    "javax.faces.source": "FORM_myform:BTN_publicSearch",
    "javax.faces.partial.execute": "@all",
    "javax.faces.partial.render": "FORM_myform:P_containerConsulta+FORM_myform:P_containerpoblaciones+FORM_myform:P_containernumeracion+FORM_myform:P_containerinfo+FORM_myform:P_containerLocal+FORM_myform:P_containerDesplegable",
    "FORM_myform:BTN_publicSearch": "FORM_myform:BTN_publicSearch",
    "FORM_myform": "FORM_myform",
    "FORM_myform:TXT_NationalNumber": "6564384757",
    "javax.faces.ViewState=": VS #ViewState
}

req = s.post(url, data=data, headers=headers)
#Saving the new response, this is supposed to bring the results
text_file = open("Output.txt", "w")
text_file.write(req.text.encode('utf-8'))
text_file.close()

问题是我获得的响应是​​没有信息的页面的完整代码,我注意到它带有一个新的ViewState,我相信这就是为什么不咨询数据。 此外,我不想使用硒,因为我在服务器上没有图形界面,我需要每天查阅很多数字。

... ... UPDATE 我相信问题依赖于JSF,需要知道如何处理数据和JSF值。

2 个答案:

答案 0 :(得分:0)

为了使用请求从网站上获取数据,您必须拥有此...

r = requests.get(url)

之后我会打印'r'变量得到的结果......

print (r)

然后我会使用for循环并将输出的文本视为数组(r [0])并检查所有文本中是否有任何可能看起来像电话号码的文本。 这只是您尝试使用网络抓取工具执行操作的方法之一,并且根本不使用xml。

总而言之,我的代码看起来像这样......

import requests

url = "myurl"
r = requests.get(url)
counter = 0
length = len(r)
while counter != length:
    if r[counter] == '1' or r[counter] == '2' or r[counter] == '3' or r[counter] == '4' or r[counter] == '5'or r[counter] == '6' or r[counter] == '7' or r[counter] == '8' or r[counter] == '9' or r[counter] == '0':
        data = r[counter:counter+12]
        print (data)
    counter += 1

答案 1 :(得分:0)

你应该尝试使用curl,比如

#!/bin/bash

CURL='/usr/bin/curl --connect-timeout 5 --max-time 50'
URL='https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml'
CURLARGS='-sD - -j'
NUM='6564193195'
c_FRONTAPPID="$($CURL $CURLARGS $URL)"
arr=($c_FRONTAPPID)

i=0
for var in "${arr[@]}"
do
  if [[ $var == *"FRONTAPPID="* ]]; then
        FRONTAPPID=$(echo "$var" | sed 's/.*FRONTAPPID=\(.*\);.*/\1/' | sed 's/!/"'"'"'!'"'"'"/g')
        #echo $var
        #echo $FRONTAPPID       
  fi
  if [[ $var == *"id=\"javax.faces.ViewState\""* ]]; then
        VIEWSTATE=$(echo ${arr[i+1]} | sed 's/.*"\(.*\)".*/\1/')
        #echo ${arr[i+1]}
        #echo $VIEWSTATE
  fi
  ((i++))
done

($CURL 'https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml' -X POST -H 'Host: sns.ift.org.mx:8081' -H 'Accept: application/xml, text/xml, */*; q=0.01' -H 'Accept-Language: en-US,en;q=0.5' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0' --compressed -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Faces-Request: partial/ajax' -H 'X-Requested-With: XMLHttpRequest' -H 'Referer: https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml'  -H "Cookie: FRONTAPPID=$FRONTAPPID" -H 'Connection: keep-alive' --data "javax.faces.partial.ajax=true&javax.faces.source=FORM_myform:BTN_publicSearch&javax.faces.partial.execute=@all&javax.faces.partial.render=FORM_myform:P_containerConsulta+FORM_myform:P_containerpoblaciones+FORM_myform:P_containernumeracion+FORM_myform:P_containerinfo+FORM_myform:P_containerLocal+FORM_myform:P_containerDesplegable&FORM_myform:BTN_publicSearch=FORM_myform:BTN_publicSearch&FORM_myform=FORM_myform&FORM_myform:TXT_NationalNumber=$NUM&javax.faces.ViewState=$VIEWSTATE" )