Question

我正在尝试从网页上抓取一个数据表，我在网上找到的所有教程都太具体了，不解释每个参数/元素是什么，所以我无法弄清楚如何使用它我的例子。任何关于在哪里找到好的教程来抓取这类数据的建议都将不胜感激;

query = urllib.urlencode({'q': company})
page = requests.get('http://www.hoovers.com/company-information/company-search.html?term=company')
tree = html.fromstring(page.text)

table =tree.xpath('//[@id="shell"]/div/div/div[2]/div[5]/div[1]/div/div[1]')

#Can't get xpath correct
#This will create a list of companies:
companies = tree.xpath('//...') 
#This will create a list of locations
locations = tree.xpath('//....')

我也尝试过：

hoover = 'http://www.hoovers.com/company-information/company-search.html?term=company'
req = urllib2.Request(hoover)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

table = soup.find("table", { "class" : "clear data-table sortable-header dashed-table-tr alternate-rows" })

f = open('output.csv', 'w')
for row in table.findAll('tr'):
    f.write(','.join(''.join([str(i).replace(',','') for i in row.findAll('td',text=True) if i[0]!='&']).split('\n')[1;-1])+'\n')
f.close()

但是我在第二行

上收到了无效的语法错误

Answer 1

Yes, beautiful soup. Here is a quick example to get the names:

hoover = 'http://www.hoovers.com/company-information/company-search.html?term=company'
req = urllib2.Request(hoover)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page.text)
trs = soup.find("div", attrs={"class": "clear data-table sortable-header dashed-table-tr alternate-rows"}).find("table").findAll("tr")
for tr in trs:
    tds = tr.findAll("td")
    if len(tds) < 1:
        continue
    name = tds[0].text
    print name
f.close()

Web中的Web Scraping表数据

1 个答案: