在Beautiful Soup中使用`.find_next_siblings`函数

时间:2014-12-06 08:35:56

标签: python html csv beautifulsoup html-parsing

我正在尝试将网络抓取的输出写入CSV文件,这是我的代码:

import bs4
import requests
import csv

#get webpage for Apple inc. September income statement
page = requests.get("https://au.finance.yahoo.com/q/is?s=AAPL")

#put into beautiful soup
soup = bs4.BeautifulSoup(page.content)

#select table that holds data of interest
table = soup.find("table", class_="yfnc_tabledata1")

#creates headers for table
headers = table.find('tr', class_="yfnc_modtitle1")

#creates generator that holds four values that are yearly revenues for company
total_revenue = headers.next_sibling
cost_of_revenue = total_revenue.next_sibling
gross_profit = cost_of_revenue.next_sibling.next_sibling
wang = headers.find_next_siblings("tr")

#iterates through generator from above and writes output to CSV file
with open('/home/kwal0203/Desktop/Apple.csv', 'a') as csvfile:
            writer = csv.writer(csvfile,delimiter="|")
            writer.writerow([value.get_text(strip=True).encode("utf-8") for value in headers])
            writer.writerow([value.get_text(strip=True).encode("utf-8") for value in total_revenue])
            writer.writerow([value.get_text(strip=True).encode("utf-8") for value in cost_of_revenue])
            writer.writerow([value.get_text(strip=True).encode("utf-8") for value in gross_profit])
            for dude in wang:
                writer.writerow([dude.get_text(strip=True).encode("utf-8")])

问题是我在创建每行到CSV时重复了很多代码。正如您所见,不断重复next_sibling以获取下一行值。我在Beautiful Soup中找到了.find_next_siblings()函数,它几乎完成了我想要的功能,但函数读取的每一行都输出到CSV文件的一个单元格中。

有什么想法吗?如果问题不明确,请告诉我。

感谢。

1 个答案:

答案 0 :(得分:0)

好吧,我认为这不是一个完美的解决方案,但我的想法是检查下一个兄弟姐妹的金额并跳过没有的行:

next_rows = [[td.get_text(strip=True).encode("utf-8") for td in row('td')] 
             for row in headers.find_next_siblings("tr")]

pattern = re.compile(r'^[\d,]+$')
data = [[item for item in l if pattern.match(item)] for l in next_rows]
data = [l for l in data if l]

with open('/home/kwal0203/Desktop/Apple.csv', 'a') as csvfile:
    writer = csv.writer(csvfile, delimiter="|")
    writer.writerows(data)

产地:

42,123,000|37,432,000|45,646,000|57,594,000
26,114,000|22,697,000|27,699,000|35,748,000
16,009,000|14,735,000|17,947,000|21,846,000
1,686,000|1,603,000|1,422,000|1,330,000
3,158,000|2,850,000|2,932,000|3,053,000
11,165,000|10,282,000|13,593,000|17,463,000
307,000|202,000|225,000|246,000
11,472,000|10,484,000|13,818,000|17,709,000
11,472,000|10,484,000|13,818,000|17,709,000
3,005,000|2,736,000|3,595,000|4,637,000
8,467,000|7,748,000|10,223,000|13,072,000
8,467,000|7,748,000|10,223,000|13,072,000
8,467,000|7,748,000|10,223,000|13,072,000

这些基本上都是表中的所有金额。