如何使用python(beautifulsoup4和请求或任何其他库)抓取特定表格的网站?

时间:2018-09-10 14:20:37

标签: python web-scraping

https://en.wikipedia.org/wiki/Economy_of_the_European_Union

上面是网站的链接,我想抓表格:Fortune top 10 E.U. corporations by revenue (2016)

请共享相同的代码:

import requests
from bs4 import BeautifulSoup

def web_crawler(url):

page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
tables = soup.findAll("tbody")[1]
print(tables)

soup = web_crawler("https://en.wikipedia.org/wiki/Economy_of_the_European_Union")

2 个答案:

答案 0 :(得分:0)

@FanMan所说的之后,这是帮助您入门的简单代码,请记住,您需要清理它并自己完成其余的工作。

import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Economy_of_the_European_Union'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
temp_datastore=list()
for text in soup.findAll('p'):
    w=text.findAll(text=True)
    if(len(w)>0):
        temp_datastore.append(w)

一些文档

美丽的汤:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

请求:http://docs.python-requests.org/en/master/user/intro/

urllib:https://docs.python.org/2/library/urllib.html

答案 1 :(得分:0)

第一个问题是您的网址定义不正确。之后,您需要找到要提取的表及其类。在这种情况下,该类是“ Wikitable”,它是第一个表。我已经为您启动了代码,因此它为您提供了从表中提取的数据。学习网络爬网是很好的方法,但是如果您只是开始编程,请先尝试一些简单的方法。

import requests
from bs4 import BeautifulSoup

def webcrawler():

    url = "https://en.wikipedia.org/wiki/Economy_of_the_European_Union"
    page = requests.get(url)
    soup = BeautifulSoup(page.text,"html.parser")
    tables = soup.findAll("table", class_='wikitable')[0]
    print(tables)

webcrawler()