我有段时间想从petango.com上的项目列表中获取文本。以下html的屏幕截图 我使用硒加载页面以使其加载。然后抓住它作为bs对象
PetSoup = BeautifulSoup(page_source, 'html.parser')
我尝试过:
ul_tag = PetSoup.findAll('div', {'class': 'group details-list'})
pet_details = [i.get_text() for i in ul_tag]
print(pet_details)
这使我将所有项目作为一个列表对象(我认为),但是我似乎无法解析它,因此我可以为每个项目分配变量(例如,年龄:,品种:等)>
['\n\nBreed: Chihuahua, Short Coat / Mix\nAge: 15y 5m Gender: Male\nColor: Black / Grey\nSpayed/Neutered: Yes\nSize: Small\nDeclawed: No\nAdoption Date: \n\n']
我也尝试过:
for ultag in PetSoup.find_all('ul', {'class': 'group details-list'}):
for litag in ultag.find_all('li'):
print(litag.text)
这使我了解了所有内容,但再次,似乎无法解析它以分配变量
我真正想做的只是获取span标签中的文本,但我似乎无法掌握它。我认为这与span标签的结构有关,但是我似乎无法通过尝试span中项目的不同变体来获得它。我只是得到一个空的回报。具体范围是:
<span data-bind="text: breed">Terrier, Rat / Mix </span>
有人能指出我正确的方向吗?感谢您的帮助! 具体页面在这里: https://www.petango.com/Adopt/Dog-Terrier-Rat-22192827
答案 0 :(得分:1)
要从span标签提取文本,只需使用span.text
。这是我提取范围文本的方式:
from bs4 import BeautifulSoup
html = '<span data-bind="text: breed">Terrier, Rat / Mix </span>'
soup = BeautifulSoup(html,'html.parser')
span = soup.find('span')
print(span.text)
输出:
Terrier, Rat / Mix
这很好。但我走了一步,从网站上抓取了所有数据。这是执行此操作的完整代码:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.petango.com/Adopt/Dog-Terrier-Rat-22192827')
html = driver.page_source
driver.close()
soup = BeautifulSoup(html,'html.parser')
ul = soup.find('div',class_ = 'group details-list').ul #Gets the ul tag
li_items = ul.find_all('li') #Finds all the li tags within the ul tag
headings = []
values = []
for li in li_items:
heading = li.strong.text
headings.append(heading)
value = li.span.text
if value:
values.append(value)
else:
values.append(None)
U还可以通过将以下行添加到代码中,使用这些列表创建漂亮的Pandas DataFrame
(以提高可读性):
details_dict = {'Headings':headings,
'Values':values}
df = pd.DataFrame(details_dict)
print(df)
输出:
Headings Values
0 Breed: Terrier, Rat / Mix
1 Age: 11y 7m
2 Color: White / Black
3 Spayed/Neutered: Yes
4 Size: Small
5 Declawed: No
6 Adoption Date: None
希望这会有所帮助!
答案 1 :(得分:0)
没有selenium
的解决方案:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.petango.com/Adopt/Dog-Terrier-Rat-22192827'
ajax_url = 'https://www.petango.com/DesktopModules/Pethealth.Petango/Pethealth.Petango.DnnModules.AnimalDetails/API/Main/GetAnimalDetails?moduleId={}&animalId={}&clientZip=null'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0',}
html_text = requests.get(url, headers=headers).text
soup = BeautifulSoup(html_text, 'html.parser')
animal_id = url.split('-')[-1]
module_id = soup.select_one('[id^="Module-"]:has(.pet-id)')['id'].split('-')[-1]
tab_id = re.search(r'`sf_tabId`:`(\d+)`', html_text).group(1)
headers['TabId'] = tab_id
data = json.loads( requests.get(ajax_url.format(module_id, animal_id), headers=headers).text )
# print data to screen:
print(json.dumps(data, indent=4))
打印:
{
"id": 22192827,
"speciesId": 1,
"species": "Dog",
"name": "Charlie 3",
"photo": "https://g.petango.com/photos/1488/6955fc68-7a2e-4da1-af34-930c5fabd713.jpg",
"photo2": "https://g.petango.com/photos/1488/450f6d7d-e501-475c-9b1f-9322dbc225eb.jpg",
"photo3": "https://g.petango.com/photos/1488/4fbdbf1e-58b0-45c8-8c12-b5122ad1fff8.jpg",
"youtubeVideoId": null,
"age": "11y 7m",
"distance": 0,
"gender": "Male",
"breed": "Terrier, Rat / Mix",
"color": "White / Black",
"spayedNeutered": "Yes",
"size": "Small",
"memo": null,
"noDogs": false,
"noCats": false,
"noKids": false,
"addToFavorites": false,
"removeFromFavorites": false,
"adoptionDate": null,
"declawed": false,
"adoptionApplicationUrl": "https://www.petango.com/Adoption-Application?shelterId=1897&animalId=22192827",
"shelterName": "OH",
"shelterCityState": "CINCINNATI, OH"
}