如何从网页解析查询字符串?

时间:2018-02-11 20:50:16

标签: python python-3.x parsing web-scraping beautifulsoup

我正在尝试解析页面中存在的所有查询字符串,因此使用该查询字符串我可以导航到特定页面。我尝试这样做的代码如下所示

    import requests


    from bs4 import BeautifulSoup
    from datetime import datetime


    import datetime
    import dateutil.parser
    import time
    import pytz


    """python espncricinfo library module https://github.com/dwillis/python-espncricinfo """
    from espncricinfo.match import Match 
    from espncricinfo.exceptions import MatchNotFoundError, NoScorecardError


    """----time-zone-calculation----"""
    time_zone = pytz.timezone("Asia/Kolkata")
    datetime_today = datetime.datetime.now(time_zone)
    datestring_today = datetime_today.strftime("%Y-%m-%d")


    """------URL of page to parse-------with a date of today-----""" 
    url = "http://www.espncricinfo.com/ci/engine/match/index.html?date=datestring_today"
    """eg. url = http://www.espncricinfo.com/ci/engine/match/index.html?date=2018-02-12"""


    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')


    """"------parsing for matchno------"""
    match_no = [x['href'].split('/',4)[4].split('.')[0] for x in 

    soup.findAll('a', href=True, text='Scorecard')]


    for p in  match_no:


    """ where p is a match no, e.g p = '1122282'"""
        m = Match(p) 
        m.latest_batting
        print(m.latest_batting)

当我打印match_no时,我得到输出:

['8890/scorecard/1118760/andhra-vs-tamil-nadu-group-c-vijay-hazare-trophy-2017-18/', '8890/scorecard/1118743/assam-vs-odisha-group-a-vijay-hazare-trophy-2017-18/', '8890/scorecard/1118745/bengal-vs-delhi-group-b-vijay-hazare-trophy-2017-18/', '8890/scorecard/1118763/chhattisgarh-vs-vidarbha-group-d-vijay-hazare-trophy-2017-18/']

这个页面(http://www.espncricinfo.com/ci/engine/match/index.html?date=datestring_today“)包含当天发生的所有比赛的match_no,我想修剪这个以得到match_no这是7位数[1118743,1118743.1118745 ....],我该怎么办?这样做?所以使用match_no我可以将它传递给Match(),这样我就可以获得当天发生的特定匹配的详细信息。 PS如果新的一天没有匹配,那么match_no将返回无。

1 个答案:

答案 0 :(得分:0)

首先,您的代码很难阅读。你需要让你的代码呼吸并使其吸引其他人阅读它。

其次,导致问题的原因可能是这一行:

match_no = [x['href'].split('/',4)[4].split('.')[0] for x in soup.findAll('a', href=True, text='Scorecard')]

也难以阅读。从URL中解析匹配id有更好,更可读的方法。

这是应该工作的例子。我确实把比赛暂定日期:

import re

import pytz
import requests
import datetime
from bs4 import BeautifulSoup
from espncricinfo.exceptions import MatchNotFoundError, NoScorecardError
from espncricinfo.match import Match

"""python espncricinfo library module https://github.com/dwillis/python-espncricinfo """
# from espncricinfo.match import Match


def get_match_id(link):
    match_id = re.search(r'([0-9]{7})', link)
    if match_id is None:
        return None
    return match_id.group()

# ----time-zone-calculation----
time_zone = pytz.timezone("Asia/Kolkata")
datetime_today = datetime.datetime.now(time_zone)
datestring_today = datetime_today.strftime("%Y-%m-%d")

# ------URL of page to parse-------with a date of today-----
url = "http://www.espncricinfo.com/ci/engine/match/index.html?date=datestring_today"

r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

spans = soup.findAll('span', {"class": "match-no"})

matches_ids = []

for s in spans:
    for a in s.findAll('a', href=lambda href: 'scorecard' in href):
        match_id = get_match_id(a['href'])
        if match_id is None:
            continue
        matches_ids.append(match_id)


# ------parsing for matchno------
for p in matches_ids:
    # where p is a match no, e.g p = '1122282'
    m = Match(p)
    m.latest_batting
    print(m.latest_batting)

现在,我没有你在这里使用的每个lib,但这应该让你知道如何做到这一点。

我的建议再一次,空行是你的朋友。他们肯定是读者的朋友。让你的代码'呼吸'。