使用漂亮汤的数据框问题

时间:2020-09-26 22:46:28

标签: python pandas dataframe beautifulsoup

我使用漂亮的汤刮数据来创建一个数据框。但是,有两个问题。

  1. 为什么for循环运行2次?
  2. 如何卸下数据框上的括号?

将urllib.request导入为req

from bs4 import BeautifulSoup
import bs4
import requests
import pandas as pd

url = "https://finance.yahoo.com/quote/BF-B/profile?p=BF-B"

root = requests.get(url)

soup = BeautifulSoup(root.text, 'html.parser')

records = []

for result in soup:
name = soup.find_all('h1', attrs={'D(ib) Fz(18px)'})
website = soup.find_all('a')[44]
sector = soup.find_all('span')[35]
industry = soup.find_all('span')[37]
records.append((name, website, sector, industry))

df = pd.DataFrame(records, columns=['name', 'website', 'sector', 'industry'])
df.head()

结果如下:

DataFrame Output

1 个答案:

答案 0 :(得分:0)

要获取有关公司的信息,您不必遍历soup,只需直接提取必要的信息即可。要摆脱[..]括号,请使用.text属性:

import requests
from bs4 import BeautifulSoup


url = 'https://finance.yahoo.com/quote/BF-B/profile?p=BF-B'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []

all_data.append({
    'Name': soup.h1.text,
    'Website': soup.select_one('.asset-profile-container a[href^="http"]')['href'],
    'Sector': soup.select_one('span:contains("Sector(s)") + span').text,
    'Industry': soup.select_one('span:contains("Industry") + span').text
})

df = pd.DataFrame(all_data)
print(df)

打印:

                              Name                      Website              Sector                           Industry
0  Brown-Forman Corporation (BF-B)  http://www.brown-forman.com  Consumer Defensive  Beverages—Wineries & Distilleries