提前感谢任何帮助。
交易是我一直在尝试从这个网站的废料数据(https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do),but直接访问该网站是不可能的。然后我需要的数据,我得到无效访问。要访问该网站,我必须去( https://www.mptax.mp.gov.in/mpvatweb/index.jsp)然后在鼠标悬停在经销商信息上时,从下拉菜单中点击“经销商搜索”。 我正在寻找Python的解决方案, 这是我试过的东西。我刚开始网络报废:import requests
from bs4 import BeautifulSoup
with requests.session() as request:
MAIN="https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do"
INITIAL="https://www.mptax.mp.gov.in/mpvatweb/"
page=request.get(INITIAL)
jsession=page.cookies["JSESSIONID"]
print(jsession)
print(page.headers)
result=request.post(INITIAL,headers={"Cookie":"JSESSIONID="+jsession+"; zoomType=0","Referer":INITIAL})
page1=request.get(MAIN,headers={"Referer":INITIAL})
soup=BeautifulSoup(page1.content,'html.parser')
data=soup.find_all("tr",class_="whitepapartd1")
print(data)
交易是我想根据公司名称废弃有关公司的数据。
答案 0 :(得分:0)
你介意使用浏览器吗?
您可以使用浏览器访问xpath(// * [@ id =“dropmenudiv”] / a [1])中的链接。
如果您之前没有使用过chromedriver,则可能需要下载并将chromedriver放入上述目录中。如果你想进行无头浏览(每次不打开浏览器),你也可以使用selenium + phantomjs。
from selenium import webdriver
xpath = "//*[@id="dropmenudiv"]/a[1]"
browser = webdriver.Chrome('/usr/local/bin/chromedriver')
browser.set_window_size(1120,550)
browser.get('https://www.mptax.mp.gov.in/mpvatweb')
link = browser.find_element_by_xpath("//*[@id="dropmenudiv"]/a[1]")
link.click()
url = browser.current_url
答案 1 :(得分:0)
感谢告诉我一个方法@Arnav和@Arman,所以这里是最后的代码:
from selenium import webdriver #to work with website
from bs4 import BeautifulSoup #to scrap data
from selenium.webdriver.common.action_chains import ActionChains #to initiate hovering
from selenium.webdriver.common.keys import Keys #to input value
PROXY = "10.3.100.207:8080" # IP:PORT or HOST:PORT
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
#ask for input
company_name=input("tell the company name")
#import website
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get("https://www.mptax.mp.gov.in/mpvatweb/")
#perform hovering to show hovering
element_to_hover_over = browser.find_element_by_css_selector("#mainsection > form:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(3) > a:nth-child(1)")
hover = ActionChains(browser).move_to_element(element_to_hover_over)
hover.perform()
#click on dealer search from dropdown menu
browser.find_element_by_css_selector("#dropmenudiv > a:nth-child(1)").click()
#we are now on the leftmenu page
#click on radio button
browser.find_element_by_css_selector("#byName").click()
#input company name
inputElement = browser.find_element_by_css_selector("#showNameField > td:nth-child(2) > input:nth-child(1)")
inputElement.send_keys(company_name)
#submit form
inputElement.submit()
#now we are on dealerssearch page
#scrap data
soup=BeautifulSoup(browser.page_source,"lxml")
#get the list of values we need
list=soup.find_all('td',class_="tdBlackBorder")
#check length of 'list' and on that basis decide what to print
if(len(list)!=0):
#company name at index=9
#tin no. at index=10
#registration status at index=11
#circle name at index=15
#store the values
name=list[9].get_text()
tin=list[10].get_text()
status=list[11].get_text()
circle=list[15].get_text()
#make dictionary
Company_Details={"TIN":tin ,"Firm name":name ,"Circle_Name":circle, "Registration_Status":status}
print(Company_Details)
else:
Company_Details={"VAT RC No":"Not found in database"}
print(Company_Details)
#close the chrome
browser.stop_client()
browser.close()
browser.quit()