I'm trying to find certain hrefs within HTML and I had been using (which had been working):
for a in soup.find_all('a', href=True):
if a['href'].startswith('/game/'):
chunk = str(a).split('''"''')
game = chunk[3]
for the following HTML:
<td colspan="4">
<a href="/game/index/4599712?org id=418" class="skipMask" target="TEAM_WIN">35-28 </a>
</td>
my code successfully gave me the /game/index/4599712?org id=418
However, there are other tags that have separate hrefs for the teams, and the record of the teams. Example:
<td nowrap bgcolor="#FFFFFF">
<a href="/team/145/18741">Philadelphia</a> == $0
" (3-1) "
</td>
I would like some advice with this. I THINK I want to 1) if the href starts with "/game/" id like to have a better way of getting that href than splitting on quotation marks (probably regular expressions?). 2) If the href starts with "/team/" Id like to be able to create a dictionary to pair Philadelphia with (3-1). Any suggestions or ideas would be appreciated.
答案 0 :(得分:0)
要获取以href
开头的所有/game/
,只需将找到的节点href
的值附加到列表中即可:
>>> result1 = []
>>> for a in soup.find_all('a', href=True):
if a['href'].startswith('/game/'):
result1.append(a['href'])
>>> print(result1)
['/game/index/4599712?org id=418']
对于第二个,您可以使用正则表达式,但是要使用下一个a
的同级纯文本:
>>> import re
>>> result2 = {}
>>> for a in soup.find_all('a', href=True):
if a['href'].startswith('/team/'):
m = re.search(r"\((\d+-\d+)\)", a.next_sibling.string)
if m:
result2[a.string] = m.group(1)
else:
result2[a.string] = ""
>>> print(result2)
{'Philadelphia': '3-1'}
\((\d+-\d+)\)
将提取数字+ -
+括号内的数字。如果此值不存在,则将添加找到的键的键值,但为空值。
答案 1 :(得分:0)
您可以使用CSS选择器来匹配以某些字符串开头的标记属性:例如soup.select('a[href^="/game/"]')
将匹配所有<a>
属性为href的/game/
标签。
第二部分,您可以使用re
模块:
from bs4 import BeautifulSoup
import re
data = '''
<td colspan="4">
<a href="/game/index/4599712?org id=418" class="skipMask" target="TEAM_WIN">35-28 </a>
</td>
<td nowrap bgcolor="#FFFFFF">
<a href="/team/145/18741">Philadelphia</a> == $0
" (3-1) "
</td>
'''
soup = BeautifulSoup(data, 'lxml')
for a in soup.select('a[href^="/game/"]'):
print(a['href'])
for a in soup.select('a[href^="/team/"]'):
m = re.findall(r'\s*(.*?)(?=\s*==).*?(\(.*?\))', a.parent.text, flags=re.DOTALL)
if m:
print(dict(m))
打印:
/game/index/4599712?org id=418
{'Philadelphia': '(3-1)'}