使用beautifulsoup 4抓取表格内容

时间:2020-05-19 23:36:47

标签: python web-scraping beautifulsoup

我正在尝试抓取this page中的“ TWITTER STATS Summary”表。

这是我的代码

rank_page = 'https://socialblade.com/twitter/user/bill%20gates'
request = urllib2.Request(rank_page, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'})
page = urllib2.urlopen(request)
soup = BeautifulSoup(page, 'html.parser')

channels = soup.find('div', attrs={'id': 'socialblade-user-content'}).find_all('div', recursive=False)[10:]

for row in channels:
    date = row.find('div', attrs={'style': 'width: 80px; float: left;'})
print date

但是我在终端中收到None。我只想在表中获取日期(关注者关注媒体的日期)。我知道如何继续并将其保存在excel中,但我很难找到div和文本。感谢您的帮助

1 个答案:

答案 0 :(得分:2)

使用Python3和requests库:

import requests
from bs4 import BeautifulSoup

rank_page = 'https://socialblade.com/twitter/user/bill%20gates'
r = requests.get(rank_page, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'})
soup = BeautifulSoup(r.content, 'html.parser')

d = soup.select_one('div:has(>div:contains("Date")) + div')

all_data = []
for div in d.find_all('div', recursive=False):
    row = div.get_text(strip=True, separator='|').split('|')
    if len(row) == 8:
        all_data.append(row)

#pretty print to screen:
print(('{:<20}'*8).format('Date', 'Day', 'Followers(chng)', 'Followers', 'Following(chng)', 'Following', 'Media(chng)', 'Media'))
for row in all_data:
    print(('{:<20}'*8).format(*row))

打印:

Date                Day                 Followers(chng)     Followers           Following(chng)     Following           Media(chng)         Media               
2020-05-06          Wed                 --                  50,310,276          --                  218                 --                  3,309               
2020-05-07          Thu                 +20,293             50,330,569          --                  218                 --                  3,309               
2020-05-08          Fri                 +17,884             50,348,453          --                  218                 +1                  3,310               
2020-05-09          Sat                 +21,294             50,369,747          --                  218                 --                  3,310               
2020-05-10          Sun                 +19,186             50,388,933          --                  218                 --                  3,310               
2020-05-11          Mon                 +19,892             50,408,825          --                  218                 --                  3,310               
2020-05-12          Tue                 +16,876             50,425,701          --                  218                 --                  3,310               
2020-05-13          Wed                 +18,973             50,444,674          --                  218                 +1                  3,311               
2020-05-14          Thu                 +16,764             50,461,438          --                  218                 --                  3,311               
2020-05-15          Fri                 +16,554             50,477,992          --                  218                 +1                  3,312               
2020-05-16          Sat                 +17,031             50,495,023          --                  218                 --                  3,312               
2020-05-17          Sun                 +14,046             50,509,069          --                  218                 --                  3,312               
2020-05-18          Mon                 +14,394             50,523,463          --                  218                 --                  3,312               
2020-05-19          Tue                 +9,208              50,532,671          --                  218                 +1                  3,313               

编辑(另存为csv文件):

#saving to csv:
import csv

with open('output.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in all_data:
        writer.writerow(row)

产生文件output.csv(来自LibreOffice的屏幕截图):

enter image description here