从表中删除特定的div部分

时间:2018-06-06 08:40:55

标签: python web-scraping

在这个'td'部分有很多没有任何名称的div部分,我想要特定div部分的数据,如何做到这一点我尝试使用下面的代码,但它给出了很多输出。

import requests
from bs4 import BeautifulSoup

url = "https://www.bloomberg.com/research/stocks/private/person.asp?personId=45794107&privcapId=8032555&previousCapId=12437591&previousTitle=Pawan%20Hans%20Limited"

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

data = []

for table in soup.findAll('table'):
    for row in table.findAll('tr'):
        for col in row.findAll('td'):
            #print(col.findAll('div'))
            data.append(col.get_text())
print(data)

我想要以下输出:

2017-Present
Independent Director
Air India Limited

2 个答案:

答案 0 :(得分:0)

import requests
from bs4 import BeautifulSoup

url = "https://www.bloomberg.com/research/stocks/private/person.asp?personId=45794107&privcapId=8032555&previousCapId=12437591&previousTitle=Pawan%20Hans%20Limited"

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

data = []

table = soup.find_all('table', cellpadding="0")[2]
divs = table.find_all('div')[1:4]

for div in divs:
    print div.get_text()

答案 1 :(得分:0)

或者你可以在不使用硬编码索引的情况下实现相同目的:

import requests
from bs4 import BeautifulSoup

url = "https://www.bloomberg.com/research/stocks/private/person.asp?personId=45794107&privcapId=8032555&previousCapId=12437591&previousTitle=Pawan%20Hans%20Limited"

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
for items in soup.find_all(class_="sectionTitle"):
    if "Board Members" in items.text:
        item = items.find_next_sibling()
        presence = items.find_next_sibling().text
        position = item.find_next("div")
        company = item.find_next("a")
        print("{}\n{}\n{}".format(presence,position.text,company.text))

输出:

2017-Present
Independent Director
Air India Limited