Question

我正在尝试从https://ngodarpan.gov.in/index.php/search/抓取Ngo的数据，例如姓名，手机号码，城市等。它以表格格式列出了非政府组织的名称，单击每个名称都会弹出一个页面。在下面的代码中，我正在为每个NGO提取onclick属性，我先获取一个发帖请求以提取数据。我尝试使用selenium访问它，但是json数据没有到来。

list_of_cells = []
for cell in row.find_all('td'):
        text = cell.text.replace("&nbsp;", "") 
        list_of_cells.append(text)
 list_of_rows.append(list_of_cells)
 writer=csv.writer(f)
 writer.writerow(list_of_cells)

通过执行上述部分，我们可以获取所有页面的表格的全部详细信息。此网站中有7721页。我们只需更改number_of_pages变量即可。

但我们的动机是找到Ngo电话号码/电子邮件ID，这是单击ngo名称链接后获得的主要目的。但这不是href链接，而是api get req，然后是发布请求以获取数据的链接。在检查的网络部分找到

driver.get("https://ngodarpan.gov.in/index.php/search/") # load the web page
sleep(2)
....
....
driver.find_element(By.NAME,"commit").submit()
for page in range(number_of_pages - 1):
    list_of_rows = []
    src = driver.page_source # gets the html source of the page
    parser = BeautifulSoup(src,'html.parser') 
    sleep(1)
    table = parser.find("table",{ "class" : "table table-bordered table-striped" })
    sleep(1)
    for row in table.find_all('tr')[:]:
        list_of_cells = []
        for cell in row.find_all('td'):
                x = requests.get("https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf")
                dat=x.json()
                z=dat["csrf_token"]
                print(z) # prints csrf token
                r= requests.post("https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info", data = {'id':'','csrf_test_name':'z'})
                json_data=r.text  # i guess here is something not working it is printing html text but we need text data of post request like mob,email,and here it will print all the data .
                with open('data1.json', 'a') as outfile:
                    json.dump(json_data, outfile)
    driver.find_element_by_xpath("//a[contains(text(),'»')]").click()

代码未运行，但正在打印html内容。

<html>
...
...
<body>
        <div id="container">
                <h1>An Error Was Encountered</h1>
                <p>The action you have requested is not allowed.</p>    </div>
</body>
</html>

Answer 1

Switch to an iframe through Selenium and python

您可以使用XPath来找到：

iframe = driver.find_element_by_xpath("//iframe[@name='Dialogue Window']")

然后切换到：

driver.switch_to.frame(iframe)

以下是切换回默认内容（不属于）的方法：

driver.switch_to.default_content()

在您的情况下，我认为“对话窗口”的名称为CalendarControlIFrame

切换到该框架后，您将可以使用Beautiful Soup获取框架的html。

Answer 2

通过避免使用硒，可以更快地完成此操作。他们的网站似乎在每次请求之前都会不断请求令牌，您可能会发现可以跳过此令牌。

以下显示了如何获取包含手机号码和电子邮件地址的JSON：

from bs4 import BeautifulSoup
import requests
import time

def get_token(sess):
    req_csrf = sess.get('https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf')
    return req_csrf.json()['csrf_token']


search_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/search_index_new/{}"
details_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info"

sess = requests.Session()

for page in range(0, 10000, 10):    # Advance 10 at a time
    print(f"Getting results from {page}")

    for retry in range(1, 10):

        data = {
            'state_search' : 7, 
            'district_search' : '',
            'sector_search' : 'null',
            'ngo_type_search' : 'null',
            'ngo_name_search' : '',
            'unique_id_search' : '',
            'view_type' : 'detail_view',
            'csrf_test_name' : get_token(sess), 
        }

        req_search = sess.post(search_url.format(page), data=data, headers={'X-Requested-With' : 'XMLHttpRequest'})
        soup = BeautifulSoup(req_search.content, "html.parser")
        table = soup.find('table', id='example')

        if table:
            for tr in table.find_all('tr'):
                row = [td.text for td in tr.find_all('td')]
                link = tr.find('a', onclick=True)

                if link:
                    link_number = link['onclick'].strip("show_ngif(')")
                    req_details = sess.post(details_url, headers={'X-Requested-With' : 'XMLHttpRequest'}, data={'id' : link_number, 'csrf_test_name' : get_token(sess)})
                    json = req_details.json()
                    details = json['infor']['0']

                    print([details['Mobile'], details['Email'], row[1], row[2]])
            break
        else:
            print(f'No data returned - retry {retry}')
            time.sleep(3)

这将为首页提供以下输出：

['9871249262', 'pnes.delhi@yahoo.com', 'Pragya Network Educational Society', 'S-52559, Narela, DELHI']
['9810042046', 'mathew.cherian@helpageindia.org', 'HelpAge India', '9270, New Delhi, DELHI']
['9811897589', 'aipssngo@yahoo.com', 'All India Parivartan Sewa Samiti', 's-43282, New Delhi, DELHI']

Answer 3

for retry in range(1, 10):
    for i in range(0,50,10):
        search_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/search_index_new/"+str(i)
        req_search = sess.post(search_url, data=data)
        soup = BeautifulSoup(req_search.content, "html.parser")
        table = soup.find('table', id='example')
        #table = None
        if table:
            for tr in table.find_all('tr'):
                row = [td.text for td in tr.find_all('td')]
                link = tr.find('a', onclick=True)

                if link:
                    link_number = link['onclick'].strip("show_ngif(')")
                    req_details = sess.post(details_url, headers={'X-Requested-With' : 'XMLHttpRequest'}, data={'id' : link_number, 'csrf_test_name' : get_token(sess)})
                    json = req_details.json()
                    details = json['infor']['0']

                    print([details['Mobile'], details['Email'], row[1], row[2]])
            break
        else:
            print(f'No data returned - retry {retry}')
            time.sleep(5)

我想遍历所有页面并一次尝试提取数据从一个页面提取数据后，它不会迭代其他页面

....
....

    ['9829059202', 'cecoedecon@gmail.com', 'CECOEDECON', '206, Jaipur, RAJASTHAN']
    ['9443382475', 'odamindia@gmail.com', 'ODAM', '43/1995, TIRUCHULI, TAMIL NADU']
    ['9816510096', 'shrisaisnr@gmail.com', 'OPEN EDUCATIONAL DEVELOPMENT RESEARCH AND WELFARE', '126/2004, SUNDERNAGAR, HIMACHAL PRADESH']
    ['9425013029', 'card_vivek@yahoo.com', 'Centre for Advanced Research and Development', '25634, Bhopal, MADHYA PRADESH']
    ['9204645161', 'secretary_smvm@yahoo.co.in', 'Srijan Mahila Vikas Manch', '833, Chakradharpur, JHARKHAND']
    ['9419107550', 'amarjit.randwal@gmail.com', 'J and K Sai Star Society', '4680-S, Jammu, JAMMU & KASHMIR']
    No data returned - retry 2
    No data returned - retry 2
    No data returned - retry 2
    No data returned - retry 2
    No data returned - retry 2
...
...

如何使用python和selenium抓取弹出窗口

3 个答案: