我有一些要刮的HTML。
<div class="prw_rup prw_common_static_map_no_style staticMap" data-prwidget-name="common_static_map_no_style" data-prwidget-init="handlers">
<div class="prv_map clickable" onclick="requireCallLast('ta/maps/opener', 'open', 2, null, null,{customFilters: []})">
<img width="310" style="width:310px;height:270px;" id="lazyload_-1295083988_4" height="270" src="https://trip-raster.citymaps.io/staticmap?scale=2&zoom=18&size=310x270&language=en&center=32.769936,-117.252693&markers=icon:http%3A%2F%2Fc1.tacdn.com%2Fimg2%2Fmaps%2Ficons%2Fpin_v2_CurrentCenter.png|32.769936,-117.25269&markers=icon:http%3A%2F%2Fc1.tacdn.com%2Fimg2%2Fmaps%2Ficons%2Fpin_lg_Restaurant.png|32.769936,-117.25269|32.770027,-117.25272&markers=icon:http%3A%2F%2Fc1.tacdn.com%2Fimg2%2Fmaps%2Ficons%2Fpin_lg_ThingToDo.png|32.77055,-117.25273|32.770683,-117.251884|32.770664,-117.25131">
</div>
</div>
如何检索子div的src?意思是,我想将URL作为字符串返回。
到目前为止,我能找到的最接近的是。
try:
mappa = driver.find_element_by_xpath("""//*[@id="taplc_location_detail_overview_restaurant_0"]/div[1]/div[2]/div[1]/div""") # .get_attribute("src")
print(mappa, "this is mappa")
child_mappa = mappa.find_element_by_xpath('.//*').get_attribute("src")
print(child_mappa)
产生:
$ <selenium.webdriver.remote.webelement.WebElement (session="4c6acf0a93bc9c184a351ddbc2180977", element="0.5263477154236882-1")>
$ https://static.tacdn.com/img2/x.gif
由于id是动态的,所以我不能用它来获取xpath。因为xpath与该ID相关。另外,为什么该src会更改?
一个人怎么得到那个src?
答案 0 :(得分:0)
所以,这有点奇怪,但是我设法使用正则表达式来获取它。我没有阅读硒,而是阅读了所有的html,使用regex查找url,然后将其拆分到需要的地方。
这不干净,但是可以。
driver.get(url)
innerHTML = driver.execute_script("return document.body.innerHTML")
print(type(innerHTML))
try:
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', innerHTML)
#print(urls)
for page_url in urls:
if 'staticmap?scale=' in page_url:
map_click = page_url.split('language=en¢er=')[1].split('&markers=icon:http')[0]
lat, long = map_click.split(',')
break
except:
lat, long = None, None