如何使用Beautiful Soup在Python中的span标签内抓取文本

时间:2020-10-09 13:49:01

标签: python selenium web-scraping beautifulsoup

我有段时间想从petango.com上的项目列表中获取文本。以下html的屏幕截图 我使用硒加载页面以使其加载。然后抓住它作为bs对象

PetSoup = BeautifulSoup(page_source, 'html.parser')

我尝试过:

ul_tag = PetSoup.findAll('div', {'class': 'group details-list'})
pet_details = [i.get_text() for i in ul_tag]
print(pet_details)

这使我将所有项目作为一个列表对象(我认为),但是我似乎无法解析它,因此我可以为每个项目分配变量(例如,年龄:,品种:等)

['\n\nBreed: Chihuahua, Short Coat / Mix\nAge: 15y 5m Gender: Male\nColor: Black / Grey\nSpayed/Neutered: Yes\nSize: Small\nDeclawed: No\nAdoption Date: \n\n']

我也尝试过:

for ultag in PetSoup.find_all('ul', {'class': 'group details-list'}):
    for litag in ultag.find_all('li'):
        print(litag.text)

这使我了解了所有内容,但再次,似乎无法解析它以分配变量

我真正想做的只是获取span标签中的文本,但我似乎无法掌握它。我认为这与span标签的结构有关,但是我似乎无法通过尝试span中项目的不同变体来获得它。我只是得到一个空的回报。具体范围是:

<span data-bind="text: breed">Terrier, Rat / Mix </span> 

有人能指出我正确的方向吗?感谢您的帮助! 具体页面在这里: https://www.petango.com/Adopt/Dog-Terrier-Rat-22192827

enter image description here

2 个答案:

答案 0 :(得分:1)

要从span标签提取文本,只需使用span.text。这是我提取范围文本的方式:

from bs4 import BeautifulSoup

html = '<span data-bind="text: breed">Terrier, Rat / Mix </span>'

soup = BeautifulSoup(html,'html.parser')

span = soup.find('span')

print(span.text)

输出:

Terrier, Rat / Mix 

这很好。但我走了一步,从网站上抓取了所有数据。这是执行此操作的完整代码:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()

driver.get('https://www.petango.com/Adopt/Dog-Terrier-Rat-22192827')

html = driver.page_source

driver.close()

soup = BeautifulSoup(html,'html.parser')

ul = soup.find('div',class_ = 'group details-list').ul #Gets the ul tag

li_items = ul.find_all('li') #Finds all the li tags within the ul tag

headings = []
values = []

for li in li_items:
    heading = li.strong.text
    headings.append(heading)
    
    value = li.span.text
    
    if value:
        values.append(value)
    else:
        values.append(None)

U还可以通过将以下行添加到代码中,使用这些列表创建漂亮的Pandas DataFrame(以提高可读性):

details_dict = {'Headings':headings,
                'Values':values}

df = pd.DataFrame(details_dict)

print(df)

输出:

       Headings              Values
0            Breed:  Terrier, Rat / Mix
1              Age:              11y 7m
2            Color:       White / Black
3  Spayed/Neutered:                 Yes
4             Size:               Small
5         Declawed:                  No
6    Adoption Date:                None  

希望这会有所帮助!

答案 1 :(得分:0)

没有selenium的解决方案:

import re
import json
import requests 
from bs4 import BeautifulSoup


url = 'https://www.petango.com/Adopt/Dog-Terrier-Rat-22192827'
ajax_url = 'https://www.petango.com/DesktopModules/Pethealth.Petango/Pethealth.Petango.DnnModules.AnimalDetails/API/Main/GetAnimalDetails?moduleId={}&animalId={}&clientZip=null'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0',}

html_text = requests.get(url, headers=headers).text
soup = BeautifulSoup(html_text, 'html.parser')

animal_id = url.split('-')[-1]
module_id = soup.select_one('[id^="Module-"]:has(.pet-id)')['id'].split('-')[-1]
tab_id = re.search(r'`sf_tabId`:`(\d+)`', html_text).group(1)

headers['TabId'] = tab_id
data = json.loads( requests.get(ajax_url.format(module_id, animal_id), headers=headers).text ) 

# print data to screen:
print(json.dumps(data, indent=4))

打印:

{
    "id": 22192827,
    "speciesId": 1,
    "species": "Dog",
    "name": "Charlie 3",
    "photo": "https://g.petango.com/photos/1488/6955fc68-7a2e-4da1-af34-930c5fabd713.jpg",
    "photo2": "https://g.petango.com/photos/1488/450f6d7d-e501-475c-9b1f-9322dbc225eb.jpg",
    "photo3": "https://g.petango.com/photos/1488/4fbdbf1e-58b0-45c8-8c12-b5122ad1fff8.jpg",
    "youtubeVideoId": null,
    "age": "11y 7m",
    "distance": 0,
    "gender": "Male",
    "breed": "Terrier, Rat / Mix",
    "color": "White / Black",
    "spayedNeutered": "Yes",
    "size": "Small",
    "memo": null,
    "noDogs": false,
    "noCats": false,
    "noKids": false,
    "addToFavorites": false,
    "removeFromFavorites": false,
    "adoptionDate": null,
    "declawed": false,
    "adoptionApplicationUrl": "https://www.petango.com/Adoption-Application?shelterId=1897&animalId=22192827",
    "shelterName": "OH",
    "shelterCityState": "CINCINNATI, OH"
}