我正在尝试抓取this page中的“ TWITTER STATS Summary”表。
这是我的代码
rank_page = 'https://socialblade.com/twitter/user/bill%20gates'
request = urllib2.Request(rank_page, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'})
page = urllib2.urlopen(request)
soup = BeautifulSoup(page, 'html.parser')
channels = soup.find('div', attrs={'id': 'socialblade-user-content'}).find_all('div', recursive=False)[10:]
for row in channels:
date = row.find('div', attrs={'style': 'width: 80px; float: left;'})
print date
但是我在终端中收到None
。我只想在表中获取日期(关注者关注媒体的日期)。我知道如何继续并将其保存在excel中,但我很难找到div和文本。感谢您的帮助
答案 0 :(得分:2)
使用Python3和requests
库:
import requests
from bs4 import BeautifulSoup
rank_page = 'https://socialblade.com/twitter/user/bill%20gates'
r = requests.get(rank_page, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'})
soup = BeautifulSoup(r.content, 'html.parser')
d = soup.select_one('div:has(>div:contains("Date")) + div')
all_data = []
for div in d.find_all('div', recursive=False):
row = div.get_text(strip=True, separator='|').split('|')
if len(row) == 8:
all_data.append(row)
#pretty print to screen:
print(('{:<20}'*8).format('Date', 'Day', 'Followers(chng)', 'Followers', 'Following(chng)', 'Following', 'Media(chng)', 'Media'))
for row in all_data:
print(('{:<20}'*8).format(*row))
打印:
Date Day Followers(chng) Followers Following(chng) Following Media(chng) Media
2020-05-06 Wed -- 50,310,276 -- 218 -- 3,309
2020-05-07 Thu +20,293 50,330,569 -- 218 -- 3,309
2020-05-08 Fri +17,884 50,348,453 -- 218 +1 3,310
2020-05-09 Sat +21,294 50,369,747 -- 218 -- 3,310
2020-05-10 Sun +19,186 50,388,933 -- 218 -- 3,310
2020-05-11 Mon +19,892 50,408,825 -- 218 -- 3,310
2020-05-12 Tue +16,876 50,425,701 -- 218 -- 3,310
2020-05-13 Wed +18,973 50,444,674 -- 218 +1 3,311
2020-05-14 Thu +16,764 50,461,438 -- 218 -- 3,311
2020-05-15 Fri +16,554 50,477,992 -- 218 +1 3,312
2020-05-16 Sat +17,031 50,495,023 -- 218 -- 3,312
2020-05-17 Sun +14,046 50,509,069 -- 218 -- 3,312
2020-05-18 Mon +14,394 50,523,463 -- 218 -- 3,312
2020-05-19 Tue +9,208 50,532,671 -- 218 +1 3,313
编辑(另存为csv文件):
#saving to csv:
import csv
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in all_data:
writer.writerow(row)
产生文件output.csv
(来自LibreOffice的屏幕截图):