我正在使用我的第一个网络抓取工具,而我正试图在墨西哥获取一些电话号码数据,而提供数据的网站是:site,它适用于xhr请求。 到目前为止我有这个代码:
from requests import Request, Session
import xml.etree.ElementTree as ET
import requests
import lxml.etree as etree
url = 'https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml'
s = Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html; charset=UTF-8',
}
str1 = s.post(url, headers=headers) #Loading the page
xhtml=str1.text.encode('utf-8')
#Savig the first response, to get the ViewState
text_file = open("loaded.txt", "w")
text_file.write(xhtml)
text_file.close()
x = ET.fromstring(xhtml)
namespace = "{http://www.w3.org/1999/xhtml}"
path = './/*[@id="javax.faces.ViewState"]'
e = x.findall(path.format(namespace))
for i in e:
VS = i.attrib['value'] #ViewState
print VS #ViewState
此时我获得了页面的ViewState,现在我发送一个新的POST,其中包含我想要咨询的数据和数字以及ViewState。
data = {
"javax.faces.partial.ajax": "true",
"javax.faces.source": "FORM_myform:BTN_publicSearch",
"javax.faces.partial.execute": "@all",
"javax.faces.partial.render": "FORM_myform:P_containerConsulta+FORM_myform:P_containerpoblaciones+FORM_myform:P_containernumeracion+FORM_myform:P_containerinfo+FORM_myform:P_containerLocal+FORM_myform:P_containerDesplegable",
"FORM_myform:BTN_publicSearch": "FORM_myform:BTN_publicSearch",
"FORM_myform": "FORM_myform",
"FORM_myform:TXT_NationalNumber": "6564384757",
"javax.faces.ViewState=": VS #ViewState
}
req = s.post(url, data=data, headers=headers)
#Saving the new response, this is supposed to bring the results
text_file = open("Output.txt", "w")
text_file.write(req.text.encode('utf-8'))
text_file.close()
问题是我获得的响应是没有信息的页面的完整代码,我注意到它带有一个新的ViewState,我相信这就是为什么不咨询数据。 此外,我不想使用硒,因为我在服务器上没有图形界面,我需要每天查阅很多数字。
... ... UPDATE 我相信问题依赖于JSF,需要知道如何处理数据和JSF值。
答案 0 :(得分:0)
为了使用请求从网站上获取数据,您必须拥有此...
r = requests.get(url)
之后我会打印'r'变量得到的结果......
print (r)
然后我会使用for循环并将输出的文本视为数组(r [0])并检查所有文本中是否有任何可能看起来像电话号码的文本。 这只是您尝试使用网络抓取工具执行操作的方法之一,并且根本不使用xml。
总而言之,我的代码看起来像这样......
import requests
url = "myurl"
r = requests.get(url)
counter = 0
length = len(r)
while counter != length:
if r[counter] == '1' or r[counter] == '2' or r[counter] == '3' or r[counter] == '4' or r[counter] == '5'or r[counter] == '6' or r[counter] == '7' or r[counter] == '8' or r[counter] == '9' or r[counter] == '0':
data = r[counter:counter+12]
print (data)
counter += 1
答案 1 :(得分:0)
你应该尝试使用curl,比如
#!/bin/bash
CURL='/usr/bin/curl --connect-timeout 5 --max-time 50'
URL='https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml'
CURLARGS='-sD - -j'
NUM='6564193195'
c_FRONTAPPID="$($CURL $CURLARGS $URL)"
arr=($c_FRONTAPPID)
i=0
for var in "${arr[@]}"
do
if [[ $var == *"FRONTAPPID="* ]]; then
FRONTAPPID=$(echo "$var" | sed 's/.*FRONTAPPID=\(.*\);.*/\1/' | sed 's/!/"'"'"'!'"'"'"/g')
#echo $var
#echo $FRONTAPPID
fi
if [[ $var == *"id=\"javax.faces.ViewState\""* ]]; then
VIEWSTATE=$(echo ${arr[i+1]} | sed 's/.*"\(.*\)".*/\1/')
#echo ${arr[i+1]}
#echo $VIEWSTATE
fi
((i++))
done
($CURL 'https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml' -X POST -H 'Host: sns.ift.org.mx:8081' -H 'Accept: application/xml, text/xml, */*; q=0.01' -H 'Accept-Language: en-US,en;q=0.5' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0' --compressed -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Faces-Request: partial/ajax' -H 'X-Requested-With: XMLHttpRequest' -H 'Referer: https://sns.ift.org.mx:8081/sns-frontend/consulta-numeracion/numeracion-geografica.xhtml' -H "Cookie: FRONTAPPID=$FRONTAPPID" -H 'Connection: keep-alive' --data "javax.faces.partial.ajax=true&javax.faces.source=FORM_myform:BTN_publicSearch&javax.faces.partial.execute=@all&javax.faces.partial.render=FORM_myform:P_containerConsulta+FORM_myform:P_containerpoblaciones+FORM_myform:P_containernumeracion+FORM_myform:P_containerinfo+FORM_myform:P_containerLocal+FORM_myform:P_containerDesplegable&FORM_myform:BTN_publicSearch=FORM_myform:BTN_publicSearch&FORM_myform=FORM_myform&FORM_myform:TXT_NationalNumber=$NUM&javax.faces.ViewState=$VIEWSTATE" )