例如,我尝试阅读此链接,其中包含美丽的汤: https://sensortower.com/ios/us/ninja-kiwi/app/bloons-td-6/1118115766/overview
html 看起来像:
<div id="app-revenue-downloads" class="shadowed-st content">
<div class="downloads-revenue-container" data-bind="with: $data.downloadsAndRevenue">
<a class="downloads-holder" data-bind="
sense: ['App Intel - Overview - Click Download Estimate', { 'App ID': $parent.appId }],
attr: { href: $parent.buildStoreIntelUrl('units') },
click: function() { $parent.switchToStoreIntel('units') },
tooltip: { animation: false, title: 'View estimates in Store Intelligence', placement: 'bottom' }" href="/store-intel/app-analysis?measure=units&os=ios&country=US&app_id=1118115766" data-original-title="">
<h3>Downloads</h3>
<span class="downloads" data-bind="text: $data.downloads">60k</span>
<span class="downloads-month">
<span data-bind="text: moment().subtract(1, 'months').subtract(10, 'days').format('MMM YYYY') + ' Worldwide'">Jun 2021 Worldwide</span>
</span>
</a>
<!-- ko if: $data.revenue -->
<a class="revenue-holder" data-bind="
sense: ['App Intel - Overview - Click Revenue Estimate', { 'App ID': $parent.appId }],
attr: { href: $parent.buildStoreIntelUrl('revenue') },
click: function() { $parent.switchToStoreIntel('revenue') },
tooltip: { animation: false, title: 'View estimates in Store Intelligence (All Revenue is Net)', placement: 'bottom' }" href="/store-intel/app-analysis?measure=revenue&os=ios&country=US&app_id=1118115766" data-original-title="">
<h3>Revenue</h3>
<span class="revenue" data-bind="text: $data.revenue">$1m</span>
<span class="revenue-month">
<span data-bind="text: moment().subtract(1, 'months').subtract(10, 'days').format('MMM YYYY') + ' Worldwide'">Jun 2021 Worldwide</span>
</span>
</a>
<!-- /ko -->
</div>
</div>
所以我首先尝试读取 id = "app-revenue-downloads" 的元素:
tmpElem = soup.find(id="app-revenue-downloads")
print(tmpElem)
但无论出于何种原因,我只读取第一个 a-tag(仅读取带有“downloads-holder”类的 a-tag,而不读取带有“revenue-holder”类的 a-tag:
<div class="shadowed-st content" id="app-revenue-downloads">
<div class="downloads-revenue-container" data-bind="with: $data.downloadsAndRevenue">
<a class="downloads-holder" data-bind="
sense: ['App Intel - Overview - Click Download Estimate', { 'App ID': $parent.appId }],
attr: { href: $parent.buildStoreIntelUrl('units') },
click: function() { $parent.switchToStoreIntel('units') },
tooltip: { animation: false, title: 'View estimates in Store Intelligence', placement: 'bottom' }" data-original-title="" href="/store-intel/app-analysis?measure=units&os=ios&country=US&app_id=1574888366">
<h3>Downloads</h3>
<span class="downloads" data-bind="text: $data.downloads">< 5k</span>
<span class="downloads-month">
<span data-bind="text: moment().subtract(1, 'months').subtract(10, 'days').format('MMM YYYY') + ' Worldwide'">Jun 2021 Worldwide</span>
</span>
</a>
<!-- ko if: $data.revenue --><!-- /ko -->
</div>
</div>
我使用此信息初始化的驱动程序:
options = Options()
options.add_argument("--window-size=1920x800")
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_experimental_option ('excludeSwitches', ['enable-logging'])
options.add_argument('--headless')
path = os.path.abspath (os.path.dirname (sys.argv[0]))
if sys.platform == "win32": cd = '/chromedriver.exe'
elif sys.platform == "linux": cd = '/chromedriver'
elif sys.platform == "darwin": cd = '/chromedriver'
driver = webdriver.Chrome (path + cd, options=options)
driver.get(elem)
driver.set_window_size(1800,1000)
soup = BeautifulSoup (driver.page_source, 'html.parser')
为什么 Beautiful Soup 只读取第一个 a 标签而不读取第二个?
答案 0 :(得分:1)
import requests
import re
r = requests.get('https://sensortower.com/ios/us/ninja-kiwi/app/bloons-td-6/1118115766/overview').text
downloads = re.findall(r'"downloads":"([^"]*)"', r)[0]
revenue = re.findall(r'"revenue":"([^"]*)"', r)[0]
print(downloads, revenue) #60k $1m
答案 1 :(得分:1)
import requests
import re
def main(url):
r = requests.get(url)
match = re.findall(r'"(?:downloads|revenue)":"(.+?)"', r.text)
print(match)
main("https://sensortower.com/ios/us/ninja-kiwi/app/bloons-td-6/1118115766/overview")
输出:
['60k', '$1m']