我正在从Google Play商店抓取应用名称,对于每个URL作为输入,我只能得到60个应用(因为如果用户不向下滚动,该网站将显示60个应用)。它是如何工作的,如何使用BeautifulSoup和/或Selenium从页面上抓取所有应用程序?
谢谢
这是我的代码:
urls = []
urls.extend(["https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid"])
for i in urls:
response = get(i)
html_soup = BeautifulSoup(response.text, 'html.parser')
app_container = html_soup.find_all('div', class_="card no-rationale square-cover apps small")
file = open("./InputFiles/applications.txt","w+")
for i in range(0, len(app_container)):
#print(app_container[i].div['data-docid'])
file.write(app_container[i].div['data-docid'] + "\n")
file.close()
num_lines = sum(1 for line in open('./InputFiles/applications.txt'))
print("Applications : " + str(num_lines) )
答案 0 :(得分:2)
在这种情况下,您需要使用Selenium
。我为您尝试了所有应用程序。我会尽力解释希望会明白。
使用Selenium
比其他Python函数更强大。我使用了ChromeDriver,因此如果您尚未安装,则可以在
from time import sleep
from selenium import webdriver
options = webdriver.ChromeOptions()
driver=webdriver.Chrome(chrome_options=options,
executable_path=r'This part is your Driver path')
driver.get('https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") ## Scroll to bottom of page with using driver
sleep(5) ## Give a delay for allow to page scroll . If we dont program will already take 60 element without letting scroll
x = driver.find_elements_by_css_selector("div[class='card-content id-track-click id-track-impression']") ## Declare which class
for a in x:
print a.text
driver.close()
输出:
1. Pocket Casts
Podcast Media LLC
₺24,99
2. Broadcastify Police Scanner Pro
RadioReference.com LLC
₺18,99
3. Relay for reddit (Pro)
DBrady
₺8,00
4. Sync for reddit (Pro)
Red Apps LTD
₺15,00
5. reddit is fun golden platinum (unofficial)
TalkLittle
₺9,99
... **UP TO 75**
注意:
不要介意钱。它是我的计数货币,因此您的货币会更改。
根据您的评论更新信息:
span标记中也包含相同的data-docid。您可以使用get_attribute
来获得它。只需将以下代码添加到您的项目中即可。
y = driver.find_elements_by_css_selector("span[class=preview-overlay-container]")
for b in y :
print b.get_attribute('data-docid')
输出
au.com.shiftyjelly.pocketcasts
com.radioreference.broadcastifyPro
reddit.news
com.laurencedawson.reddit_sync.pro
com.andrewshu.android.redditdonation
com.finazzi.distquakenoads
com.twitpane.premium
org.fivefilters.kindleit
.... UP TO 75