我正在尝试从http://corp.sec.state.ma.us/CorpWeb/CorpSearch/CorpSearchResults.aspx抓取数据 我想打开页面上的每一个,以便我可以抓取公司数据。 我想从每个公司的页面上抓取;国内利润公司的确切名称, 实体类型,识别号,在马萨诸塞州的组织日期,非自愿日期,主要办公室的位置,注册代理人的名称和地址以及公司的官员和董事。在第一页上收集了所有公司的数据后,我想单击底部的下一页转到第2页,并重复相同的过程。 到目前为止,我已经可以使用硒打开网站,并搜索所有以L开头的公司,我也打开了下一页,但是我不知道如何从company表中抓取数据。我想从每个公司抓取数据并将其保存在csv中。 到目前为止,这是我的代码;
import requests, os
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import string
browser = webdriver.Firefox()
#Site url
url = 'http://corp.sec.state.ma.us/corpweb/CorpSearch/CorpSearch.aspx' #starting url
browser.get(url)
userElem = browser.find_element_by_id('MainContent_txtEntityName')
# Looping through all the characters in the alphabet plus numbers 0 - 9
for i in string.ascii_lowercase + string.digits[1:]:
print(i)
userElem.send_keys('l')
linkEleme = browser.find_element_by_id('MainContent_btnSearch')
linkEleme.click()
time.sleep(20)
p = browser.current_url
print('This is the new url: ',(p))
#TODO: Find the href link to each corporation on the page
url2 = p
browser.get(p)
business_elements = browser.find_elements_by_class_name('link')
j = browser.find_element_by_id('MainContent_SearchControl_grdSearchResultsEntity')
links = [link.get_attribute('href') for link in j.find_elements_by_tag_name('a')]
#Finding all the links on the first page
# Find all the links on the page
linkList = browser.find_elements_by_tag_name('a')
correct_links = []
for i in range(len(linkList)):
f = (linkList[i]).get_attribute('href')
if f.startswith('http://corp.sec.state.ma.us'):
correct_links.append(f)
print(len(correct_links))
print('Collecting links on the next pages.....')
for i in range(2, 27):
try:
x_path_element = '//*[@id="MainContent_SearchControl_grdSearchResultsEntity"]/tbody/tr[27]/td/table/tbody/tr/' + 'td' + '['+ str(i) + ']'+ '/a'
print('Going to the page {}.....'.format(i))
browser.find_element_by_xpath(str(x_path_element)).click()
time.sleep(20)
r = browser.current_url
print('This is the new url: ',(r))
#'//*[@id="MainContent_SearchControl_grdSearchResultsEntity"]/tbody/tr[27]/td/table/tbody/tr/td[2]/a'
print('Finding links on the page {}.....'.format(r))
linkList2 = browser.find_elements_by_tag_name('a')
for i in range(len(linkList2)):
f = (linkList2[i]).get_attribute('href')
if f.startswith('http://corp.sec.state.ma.us'):
correct_links.append(f)
for i in correct_links:
# Opening the links that we found in correct_links
print(i)
except:
print('The link is for page {} is dead'.format(i))
print(len((correct_links)))
print('Finding the data about the companies')
print('Preparing a csv file to store data')
with open('Mass_company_data.csv', 'w') as f:
f.write("ID Number, Company Name, Entity Type, Date of Organization in Mass, Address, City, Owner, Owner Address \n")
for i in correct_links:
# Opening the links that we found in correct_links
print(i)
browser.get(i)
# scrape new page data
j = browser.current_url
company_data = browser.find_elements_by_class_name('p1')
更新: 我已经能够滚动浏览页面并找到所有页面上的所有公司,但是我仍然不知道如何从每个公司的页面中抓取数据