如何网络收集多个公司的Wiki表

时间:2019-05-06 02:27:58

标签: python web-scraping

我正在尝试网络收集三星,阿里巴巴等多家公司的Wiki表,但无法这样做。下面是我的代码

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

csvFile = open('Information.csv', 'wt+')
writer = csv.writer(csvFile)
lst=['Samsung','Facebook','Google','Tata_Consultancy_Services','Wipro','IBM','Alibaba_Group','Baidu','Yahoo!','Oracle_Corporation']
for a in lst:
    html = urlopen("https://en.wikipedia.org/wiki/a")
    bs = BeautifulSoup(html, 'html.parser')
    table = bs.findAll('table')
    for tr in table:
        rows = tr.findAll('tr')
        for row in rows:
            csvRow = [] 
            for cell in row.findAll(['td', 'th']):
                csvRow.append(cell.get_text())

         print(csvRow)
         writer.writerow(csvRow)

2 个答案:

答案 0 :(得分:1)

您将a作为字符串本身传递,而不是引用列表中的一项。这是更正的代码:

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

csvFile = open('Information.csv', 'wt+')
writer = csv.writer(csvFile)
lst=['Samsung','Facebook','Google','Tata_Consultancy_Services','Wipro','IBM','Alibaba_Group','Baidu','Yahoo!','Oracle_Corporation']
for a in lst:
    html = urlopen("https://en.wikipedia.org/wiki/{}".format(a))
    bs = BeautifulSoup(html, 'html.parser')
    table = bs.findAll('table')
    for tr in table:
        rows = tr.findAll('tr')
        for row in rows:
            csvRow = [] 
            for cell in row.findAll(['td', 'th']):
                csvRow.append(cell.get_text())

         print(csvRow)
         writer.writerow(csvRow)

答案 1 :(得分:0)

html = urlopen("https://en.wikipedia.org/wiki/a")是问题所在。

您正在遍历lst来获取每个公司的url,但未能通过在urlopen方法中使用字符串文字来实现。

解决此问题的方法是将html = urlopen("https://en.wikipedia.org/wiki/a")替换为以下任一选项:

  • html = urlopen("https://en.wikipedia.org/wiki/" + a)
  • html = urlopen(f"https://en.wikipedia.org/wiki/{a}") #requires python 3.6+
  • html = urlopen("https://en.wikipedia.org/wiki/{}".format(a))