如何解析具有相同标签的数据?

时间:2018-03-06 08:29:13

标签: python python-3.x parsing web-scraping beautifulsoup

我正在尝试解析数据以查找相同标签下的详细信息但我无法执行此操作。 我试过的剧本:

intList

我得到如下输出

import re
import pytz
import requests
import datetime
from flask import url_for
from bs4 import BeautifulSoup
from urllib.parse import urljoin    

bigbash_article_link = "http://www.espncricinfo.com/ci/content/squad/1134829.html"

r = requests.get(bigbash_article_link)
bigbash_article_html = r.text

soup = BeautifulSoup(bigbash_article_html, "html.parser")    

items = soup.find_all("div",{"class":"large-7 medium-7 small-7 columns"})
items1 = soup.find_all("h3")
items2 = soup.find_all("span")        

bigbash_article_dict = []

for div in items:    
     a =div.find('img')['src']   
     b = 'http://www.espncricinfo.com/'
     c = urljoin(b,a)
     print(c)
     #c[bigbash_article_dict]
     #print(bigbash_article_dict)
for div in items1:
     a =div.find('a').string         
     print(a)
for div in items2:
     a =(div.find('span')).text  
     print(a)

我得到属性错误如果我尝试解析span标记内的细节。有没有办法在一个字典列表中提取所有已解析的详细信息 我想要的输出

http://www.espncricinfo.com/inline/content/image/1099912.html?alt=icon
http://www.espncricinfo.com/inline/content/image/751925.html?alt=icon
http://www.espncricinfo.com/inline/content/image/599004.html?alt=icon
http://www.espncricinfo.com/inline/content/image/549144.html?alt=icon
http://www.espncricinfo.com/inline/content/image/986769.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1099468.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1100136.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1100133.html?alt=icon
http://www.espncricinfo.com/inline/content/image/721225.html?alt=icon
http://www.espncricinfo.com/inline/content/image/818215.html?alt=icon
http://www.espncricinfo.com/inline/content/image/443920.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1080507.html?alt=icon
http://www.espncricinfo.com/inline/content/image/986785.html?alt=icon
http://www.espncricinfo.com/inline/content/image/517833.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1099482.html?alt=icon
http://www.espncricinfo.com/inline/content/image/708777.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1093893.html?alt=icon
http://www.espncricinfo.com/inline/content/image/818165.html?alt=icon
http://www.espncricinfo.com/inline/content/image/1099914.html?alt=icon

                        Virat Kohli


                        Moeen Ali


                        Murugan Ashwin


                        Yuzvendra Chahal


                        Aniket Choudhary


                        Nathan Coulter-Nile


                        Colin de Grandhomme


                        Quinton de Kock


                        Pavan Deshpande


                        AB de Villiers


                        Aniruddha Joshi


                        Sarfaraz Khan


                        Kulwant Khejroliya


                        Brendon McCullum


                        Mandeep Singh


                        Mohammed Siraj


                        Pawan Negi


                        Parthiv Patel


                        Navdeep Saini


                        Tim Southee


                        Manan Vohra


                        Washington Sundar


                        Chris Woakes


                        Umesh Yadav

Traceback (most recent call last):
  File "qwe.py", line 41, in <module>
    a =(div.find('span')).text   
AttributeError: 'NoneType' object has no attribute 'text'

1 个答案:

答案 0 :(得分:3)

尝试以下方法。我正在迭代li标签:

details = soup.find("div",{"class":"large-20 medium-20 small-20 columns"})
list = details.find_all('li')
bigbash_article_dict = {}


for div in list:
    image_div = div.find("div", {"class": "large-7 medium-7 small-7 columns"})
    image_present = False
    image_sub_path = "http://www.espncricinfo.com/dummyImage"

    if image_div is not None:
        image_sub_path = image_div.find('img')['src']
        image_present = True

    domain = 'http://www.espncricinfo.com/'
    image_path = urljoin(domain,image_sub_path)
    bigbash_article_dict['image'] = image_path

    if image_present:
        details_div = div.find("div",{"class":"large-13 medium-13 small-13 columns"})
    else:   details_div = div.find("div",{"class":"large-13 medium-13 small-20 columns"})

    name = details_div.find('a').text.strip()
    bigbash_article_dict['name'] = name

    for span in details_div.find_all('span'):
        info = span.text
        if ':' not in info:
            key = "Role"
            value = info
        else:
            key = info.split(':')[0]
            value = info.split(':')[1]
        bigbash_article_dict[key] = value

    print(bigbash_article_dict)