如何解析不同的标签数据并单独存储?

时间:2018-03-02 05:20:13

标签: python python-3.x parsing web-scraping beautifulsoup

我试图分别从以下网站解析班级:fixture_date和班级:play_team

http://www.espncricinfo.com/ci/content/series/1128817.html?template=fixtures

代码:

import re
import pytz
import requests
import datetime
from bs4 import BeautifulSoup
from espncricinfo.exceptions import MatchNotFoundError, NoScorecardError
from espncricinfo.match import Match

bigbash_article_link = "http://www.espncricinfo.com/ci/content/series/1128817.html?template=fixtures"

    r = requests.get(bigbash_article_link)
    bigbash_article_html = r.text

    soup = BeautifulSoup(bigbash_article_html, "html.parser")


    bigbash1_items = soup.find_all("span",{"class": "fixture_date"})
    #print(bigbash1_items)
    bigbash_items = soup.find_all("span",{"class": "play_team"})
    date = {}
    team = {}

    for div in bigbash_items:
         team = [div.find('a').string.strip("\n\r")]
         print(team)
    for div in bigbash1_items:
         date = [div.string.strip("\xa0local\n\r\t")]
         print(date)

输出:

['1st Match - Peshawar Zalmi v Multan Sultans']
['2nd Match - Karachi Kings v Quetta Gladiators']
['3rd Match - Multan Sultans v Lahore Qalandars']
['4th Match - Islamabad United v Peshawar Zalmi']
['5th Match - Quetta Gladiators v Lahore Qalandars']
['6th Match - Multan Sultans v Islamabad United']
['7th Match - Karachi Kings v Peshawar Zalmi']
['8th Match - Karachi Kings v Lahore Qalandars']
['9th Match - Islamabad United v Quetta Gladiators']
['10th Match - Quetta Gladiators v Peshawar Zalmi']
['11th Match - Multan Sultans v Karachi Kings']
['12th Match - Lahore Qalandars v Islamabad United']
['13th Match - Multan Sultans v Quetta Gladiators']
['14th Match - Peshawar Zalmi v Lahore Qalandars']
['15th Match - Islamabad United v Karachi Kings']
['16th Match - Peshawar Zalmi v Multan Sultans']
['17th Match - Multan Sultans v Quetta Gladiators']
['18th Match - Islamabad United v Lahore Qalandars']
['19th Match - Karachi Kings v Quetta Gladiators']
['20th Match - Multan Sultans v Lahore Qalandars']
['21st Match - Peshawar Zalmi v Islamabad United']
['22nd Match - Multan Sultans v Karachi Kings']
['23rd Match - Peshawar Zalmi v Quetta Gladiators']
['24th Match - Karachi Kings v Lahore Qalandars']
['25th Match - Multan Sultans v Islamabad United']
['26th Match - Quetta Gladiators v Lahore Qalandars']
['27th Match - Peshawar Zalmi v Karachi Kings']
['28th Match - Quetta Gladiators v Islamabad United']
['29th Match - Peshawar Zalmi v Lahore Qalandars']
['30th Match - Islamabad United v Karachi Kings']
['Qualifier - TBC v TBC']
['Eliminator 1 - TBC v TBC']
['Eliminator 2 - TBC v TBC']
['Final - TBC v TBC']
['Thu Feb 22']
['21:00']
['Fri Feb 23']
['15:30']
['Fri Feb 23']
['20:00']
['Sat Feb 24']
['15:30']
['Sat Feb 24']
['20:00']
['Sun Feb 25']
['15:30']
['Sun Feb 25']
['20:00']
['Mon Feb 26']
['20:00']
['Wed Feb 28']
['20:00']
['Thu Mar 1']
['20:00']
['Fri Mar 2']
['15:30']
['Fri Mar 2']
['20:00']
['Sat Mar 3']
['15:30']
['Sat Mar 3']
['20:00']
['Sun Mar 4']
['20:00']
['Tue Mar 6']
['20:00']
['Wed Mar 7']
['20:00']
['Thu Mar 8']
['15:30']
['Thu Mar 8']
['20:00']
['Fri Mar 9']
['15:30']
['Fri Mar 9']
['20:00']
['Sat Mar 10']
['15:30']
['Sat Mar 10']
['20:00']
['Sun Mar 11']
['20:00']
['Tue Mar 13']
['20:00']
['Wed Mar 14']
['20:00']
['Thu Mar 15']
['15:30']
['Thu Mar 15']
['20:00']
['Fri Mar 16']
['15:30']
['Fri Mar 16']
['20:00']
['Sun Mar 18']
['20:00']
['Tue Mar 20']
['Wed Mar 21']
['Sun Mar 25']

我想将这些值存储在字典列表中,如

预期产出:

[{'team':'1st Match - Peshawar Zalmi v Multan Sultans','date':'Thu Feb 22', 'time':'21:00'}
{'team':'2nd Match - Karachi Kings v Quetta Gladiators','date':'Thu Feb 23', 'time':'15:30'}
{'team':'3rd Match - Multan Sultans v Lahore Qalandars','date':'Thu Feb 24', 'time':'20:00'}
.....{'team':'Eliminator 1 - TBC v TBC','date':'Wed Mar 21', 'time':''}{'team':'Final - TBC v TBC','date':'Sun Mar 25', 'time':''}]

问题是date = {}包含日期和时间值的单独列表,我该怎么做?

2 个答案:

答案 0 :(得分:0)

此代码解析下载的"灯具"您可以在您提供的网址顶部下载的文件。我知道这似乎不是一种优先考虑的方式,但信息似乎是最新的。例如,该网站显示似乎已经播放的比赛(从2月开始),但.ics文件以明天(3月2日)播放的比赛开始。

>>> import re
... from datetime import datetime
... 
... REGEX = re.compile(r'''\
... SUMMARY:(?P<team>.+)\n
... DTSTART:(?P<start>.+)\n
... DTEND:(?P<end>.+)\n
... LOCATION:(?P<location>.+)\n''', re.VERBOSE)
... 
... 
... def to_datetime(s):
...     return datetime.strptime(s, '%Y%m%dT%H%M00Z')
... 
... 
... result = []
... with open('Pakistan_Super_League.ics', 'r') as f:
...     for m in REGEX.finditer(f.read()):
...         current = m.groupdict()
...         start = to_datetime(current['start'])
...         result.append({
...             'team': current['team'],
...             'date': start.strftime('%a %b %d'),
...             'time': start.strftime('%H:%M')
...         })
... 
>>> for event in result:
...     print(event)
... 
{'team': '11th Match Multan Sultans v Karachi Kings', 'date': 'Fri Mar 02', 'time': '11:30'}
{'team': '12th Match Lahore Qalandars v Islamabad United', 'date': 'Fri Mar 02', 'time': '16:00'}
{'team': '13th Match Multan Sultans v Quetta Gladiators', 'date': 'Sat Mar 03', 'time': '11:30'}
{'team': '14th Match Peshawar Zalmi v Lahore Qalandars', 'date': 'Sat Mar 03', 'time': '16:00'}
{'team': '15th Match Islamabad United v Karachi Kings', 'date': 'Sun Mar 04', 'time': '16:00'}
{'team': '16th Match Peshawar Zalmi v Multan Sultans', 'date': 'Tue Mar 06', 'time': '16:00'}
{'team': '17th Match Multan Sultans v Quetta Gladiators', 'date': 'Wed Mar 07', 'time': '16:00'}
{'team': '18th Match Islamabad United v Lahore Qalandars', 'date': 'Thu Mar 08', 'time': '11:30'}
{'team': '19th Match Karachi Kings v Quetta Gladiators', 'date': 'Thu Mar 08', 'time': '16:00'}
{'team': '20th Match Multan Sultans v Lahore Qalandars', 'date': 'Fri Mar 09', 'time': '11:30'}
{'team': '21st Match Peshawar Zalmi v Islamabad United', 'date': 'Fri Mar 09', 'time': '16:00'}
{'team': '22nd Match Multan Sultans v Karachi Kings', 'date': 'Sat Mar 10', 'time': '11:30'}
{'team': '23rd Match Peshawar Zalmi v Quetta Gladiators', 'date': 'Sat Mar 10', 'time': '16:00'}
{'team': '24th Match Karachi Kings v Lahore Qalandars', 'date': 'Sun Mar 11', 'time': '16:00'}
{'team': '25th Match Multan Sultans v Islamabad United', 'date': 'Tue Mar 13', 'time': '16:00'}
{'team': '26th Match Quetta Gladiators v Lahore Qalandars', 'date': 'Wed Mar 14', 'time': '16:00'}
{'team': '27th Match Peshawar Zalmi v Karachi Kings', 'date': 'Thu Mar 15', 'time': '11:30'}
{'team': '28th Match Quetta Gladiators v Islamabad United', 'date': 'Thu Mar 15', 'time': '16:00'}
{'team': '29th Match Peshawar Zalmi v Lahore Qalandars', 'date': 'Fri Mar 16', 'time': '11:30'}
{'team': '30th Match Islamabad United v Karachi Kings', 'date': 'Fri Mar 16', 'time': '16:00'}
{'team': 'Qualifier TBD v TBD', 'date': 'Sun Mar 18', 'time': '16:00'}
{'team': 'Eliminator 1 TBD v TBD', 'date': 'Tue Mar 20', 'time': '00:00'}
{'team': 'Eliminator 2 TBD v TBD', 'date': 'Wed Mar 21', 'time': '00:00'}
{'team': 'Final TBD v TBD', 'date': 'Sun Mar 25', 'time': '00:00'}

答案 1 :(得分:0)

如果您快速查看被检查的元素,每行(每个夹具)都出现在以下标记内:

<li class="large-20 medium-20 columns" team1="xxxx" team2="xxxx" venue="xxxx">

所以,你可以迭代它并在每个循环中获得团队,日期和时间。

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.espncricinfo.com/ci/content/series/1128817.html?template=fixtures')
soup = BeautifulSoup(r.text, 'lxml')

fixtures = []

for row in soup.find_all('li', class_='large-20 medium-20 columns'):
    team = row.find('span', class_='play_team').a.text.strip('\n\r')
    date_and_time = row.find_all('span', class_='fixture_date')
    date = date_and_time[0].text.strip()
    try:
        time = date_and_time[1].text.strip('\xa0local\n\r\t')
    except IndexError:
        time = ''

    fixtures.append({'team': team, 'date': date, 'time': time})

for f in fixtures:
    print(f)

输出:

{'team': '1st Match - Peshawar Zalmi v Multan Sultans', 'date': 'Thu Feb 22', 'time': '21:00'}
{'team': '2nd Match - Karachi Kings v Quetta Gladiators', 'date': 'Fri Feb 23', 'time': '15:30'}
{'team': '3rd Match - Multan Sultans v Lahore Qalandars', 'date': 'Fri Feb 23', 'time': '20:00'}
{'team': '4th Match - Islamabad United v Peshawar Zalmi', 'date': 'Sat Feb 24', 'time': '15:30'}
{'team': '5th Match - Quetta Gladiators v Lahore Qalandars', 'date': 'Sat Feb 24', 'time': '20:00'}
{'team': '6th Match - Multan Sultans v Islamabad United', 'date': 'Sun Feb 25', 'time': '15:30'}
{'team': '7th Match - Karachi Kings v Peshawar Zalmi', 'date': 'Sun Feb 25', 'time': '20:00'}
{'team': '8th Match - Karachi Kings v Lahore Qalandars', 'date': 'Mon Feb 26', 'time': '20:00'}
{'team': '9th Match - Islamabad United v Quetta Gladiators', 'date': 'Wed Feb 28', 'time': '20:00'}
{'team': '10th Match - Quetta Gladiators v Peshawar Zalmi', 'date': 'Thu Mar 1', 'time': '20:00'}
{'team': '11th Match - Multan Sultans v Karachi Kings', 'date': 'Fri Mar 2', 'time': '15:30'}
{'team': '12th Match - Lahore Qalandars v Islamabad United', 'date': 'Fri Mar 2', 'time': '20:00'}
{'team': '13th Match - Multan Sultans v Quetta Gladiators', 'date': 'Sat Mar 3', 'time': '15:30'}
{'team': '14th Match - Peshawar Zalmi v Lahore Qalandars', 'date': 'Sat Mar 3', 'time': '20:00'}
{'team': '15th Match - Islamabad United v Karachi Kings', 'date': 'Sun Mar 4', 'time': '20:00'}
{'team': '16th Match - Peshawar Zalmi v Multan Sultans', 'date': 'Tue Mar 6', 'time': '20:00'}
{'team': '17th Match - Multan Sultans v Quetta Gladiators', 'date': 'Wed Mar 7', 'time': '20:00'}
{'team': '18th Match - Islamabad United v Lahore Qalandars', 'date': 'Thu Mar 8', 'time': '15:30'}
{'team': '19th Match - Karachi Kings v Quetta Gladiators', 'date': 'Thu Mar 8', 'time': '20:00'}
{'team': '20th Match - Multan Sultans v Lahore Qalandars', 'date': 'Fri Mar 9', 'time': '15:30'}
{'team': '21st Match - Peshawar Zalmi v Islamabad United', 'date': 'Fri Mar 9', 'time': '20:00'}
{'team': '22nd Match - Multan Sultans v Karachi Kings', 'date': 'Sat Mar 10', 'time': '15:30'}
{'team': '23rd Match - Peshawar Zalmi v Quetta Gladiators', 'date': 'Sat Mar 10', 'time': '20:00'}
{'team': '24th Match - Karachi Kings v Lahore Qalandars', 'date': 'Sun Mar 11', 'time': '20:00'}
{'team': '25th Match - Multan Sultans v Islamabad United', 'date': 'Tue Mar 13', 'time': '20:00'}
{'team': '26th Match - Quetta Gladiators v Lahore Qalandars', 'date': 'Wed Mar 14', 'time': '20:00'}
{'team': '27th Match - Peshawar Zalmi v Karachi Kings', 'date': 'Thu Mar 15', 'time': '15:30'}
{'team': '28th Match - Quetta Gladiators v Islamabad United', 'date': 'Thu Mar 15', 'time': '20:00'}
{'team': '29th Match - Peshawar Zalmi v Lahore Qalandars', 'date': 'Fri Mar 16', 'time': '15:30'}
{'team': '30th Match - Islamabad United v Karachi Kings', 'date': 'Fri Mar 16', 'time': '20:00'}
{'team': 'Qualifier - TBC v TBC', 'date': 'Sun Mar 18', 'time': '20:00'}
{'team': 'Eliminator 1 - TBC v TBC', 'date': 'Tue Mar 20', 'time': ''}
{'team': 'Eliminator 2 - TBC v TBC', 'date': 'Wed Mar 21', 'time': ''}
{'team': 'Final - TBC v TBC', 'date': 'Sun Mar 25', 'time': ''}