如何从<a> to

时间:2018-08-22 20:18:38

标签: python html regex beautifulsoup href

I'm trying to find certain hrefs within HTML and I had been using (which had been working):

for a in soup.find_all('a', href=True):
    if a['href'].startswith('/game/'):
        chunk = str(a).split('''"''')
        game = chunk[3]

for the following HTML:

<td colspan="4">
    <a href="/game/index/4599712?org id=418" class="skipMask" target="TEAM_WIN">35-28 </a>
</td>

my code successfully gave me the /game/index/4599712?org id=418

However, there are other tags that have separate hrefs for the teams, and the record of the teams. Example:

<td nowrap bgcolor="#FFFFFF">
    <a href="/team/145/18741">Philadelphia</a> == $0
    " (3-1)                                     "
</td>

I would like some advice with this. I THINK I want to 1) if the href starts with "/game/" id like to have a better way of getting that href than splitting on quotation marks (probably regular expressions?). 2) If the href starts with "/team/" Id like to be able to create a dictionary to pair Philadelphia with (3-1). Any suggestions or ideas would be appreciated.

2 个答案:

答案 0 :(得分:0)

要获取以href开头的所有/game/,只需将找到的节点href的值附加到列表中即可:

>>> result1 = []
>>> for a in soup.find_all('a', href=True):
    if a['href'].startswith('/game/'):
        result1.append(a['href'])

>>> print(result1)
['/game/index/4599712?org id=418']

对于第二个,您可以使用正则表达式,但是要使用下一个a的同级纯文本:

>>> import re
>>> result2 = {}
>>> for a in soup.find_all('a', href=True):
    if a['href'].startswith('/team/'):
        m = re.search(r"\((\d+-\d+)\)", a.next_sibling.string)
        if m:
            result2[a.string] = m.group(1)
        else:
            result2[a.string] = ""

>>> print(result2)
{'Philadelphia': '3-1'}

\((\d+-\d+)\)将提取数字+ - +括号内的数字。如果此值不存在,则将添加找到的键的键值,但为空值。

答案 1 :(得分:0)

您可以使用CSS选择器来匹配以某些字符串开头的标记属性:例如soup.select('a[href^="/game/"]')将匹配所有<a>属性为href的/game/标签。

第二部分,您可以使用re模块:

from bs4 import BeautifulSoup
import re

data = '''
<td colspan="4">
    <a href="/game/index/4599712?org id=418" class="skipMask" target="TEAM_WIN">35-28 </a>
</td>
<td nowrap bgcolor="#FFFFFF">
    <a href="/team/145/18741">Philadelphia</a> == $0
    " (3-1)                                     "
</td>
'''

soup = BeautifulSoup(data, 'lxml')

for a in soup.select('a[href^="/game/"]'):
  print(a['href'])

for a in soup.select('a[href^="/team/"]'):
    m = re.findall(r'\s*(.*?)(?=\s*==).*?(\(.*?\))', a.parent.text, flags=re.DOTALL)
    if m:
        print(dict(m))

打印:

/game/index/4599712?org id=418
{'Philadelphia': '(3-1)'}