Given this file中提取所有href和标题:
<a data-parent="#accordion1" data-toggle="collapse" href="# fruitName1" title="Click to expand drug name">
<span class="list-unstyled" style="text-decoration: none;"></span> GLIPIZIDE
</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114223" title="Click to view LEMONS (LEMONS) | POQ #114223 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 1 ">
LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114226" title="Click to view LEMONS (LEMONS) | POQ #114226 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 2 ">
LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114305" title="Click to view LEMONS (LEMONS) | POQ #114305 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 3 ">
LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114370" title="Click to view LEMONS (LEMONS) | POQ #114370 | BOX;67 PZ | Discontinued | FRUIT COMPANY 1 ">
LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114378" title="Click to view LEMONS (LEMONS) | POQ #114378 | BOX;67 PZ | Discontinued | FRUIT COMPANY 4 ">
LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114387" title="Click to view LEMONS (LEMONS) | POQ #114387 | BOX;67 PZ | Discontinued | FRUIT COMPANY 5 ">
LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114438" title="Click to view LEMONS (LEMONS) | POQ #114438 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 2 ">
LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114497" title="Click to view LEMONS (LEMONS) | POQ #114497 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 5 ">
LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114542" title="Click to view LEMONS (LEMONS) | POQ #114542 | BOX;67 PZ | Discontinued | FRUIT COMPANY 3 ">
LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114550" title="Click to view LEMONS (LEMONS) | POQ #114550 |
</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=117270" title="Click to view GRAPES (GREEN GRAPES ; AUS) | POQ #117270 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 10 ">
GRAPES (GREEN GRAPES ; AUS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=117511" title="Click to view GRAPES (GREEN GRAPES ; AUS) | POQ #117511 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 11 ">
GRAPES (GREEN GRAPES ; AUS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=117620" title="Click to view GRAPES (GREEN GRAPES ; AUS) | POQ #117620 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 12 ">
使用正则表达式或漂亮的汤,如何提取所有<a href="" title="">
,在www.example.com
标记之前添加href
:
www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114223 | title= | Click to view LEMONS (LEMONS) | POQ #114223 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 1 | LEMONS (LEMONS)
www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114226 | title= | Click to view LEMONS (LEMONS) | POQ #114226 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 2 | LEMONS (LEMONS)
www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114305 | title= | Click to view LEMONS (LEMONS) | POQ #114305 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 3 | LEMONS (LEMONS)
www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=114370 | title= | Click to view LEMONS (LEMONS) | POQ #114370 | BOX;67 PZ | Discontinued | FRUIT COMPANY 1 | LEMONS (LEMONS)
我试图:
for a in soup.tbody.findAll('a', href=True):
r = re.compile('(?<=href=").*?(?=")')
r.findall(str(a)
和
for a in soup.tbody.findAll('a', href=True):
print (a.find('a')['href'])
print (a.find('a')['title'])
但是,我不知道如何重新排列标题和href。 的更新
根据odradek的回答,我尝试了这个:
soup = BeautifulSoup(open('file.htm'), 'lxml')
for a in soup.tbody.findAll('a', href=True):
html = a
PREFIX = 'www.example.com'
template = '{prefix}{url} | {title}'.format
links = [template(prefix=PREFIX, url=e['href'], title=e['title']) for e in html.find_all('a', href=True)]
print(links)
但是我得到了:
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
答案 0 :(得分:2)
你可以使用BeautifulSoup解析方法而不是复杂的正则表达式:
# this is the url you want to add at the beginning
PREFIX = 'www.example.com'
# the template of your desired output
template = '{prefix}{url} | {title}'.format
# the resulting list, please note that "html" variable is
# the given source code.
links = [template(prefix=PREFIX, url=e.get('href'), title=e.get('title'))
for e in html.find_all('a', href=True)]
针对列表中的两个a
标记运行时:
$ python get_all_a.py
www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=117511 | Click to view GRAPES (GREEN GRAPES ; AUS) | POQ #117511 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 11
www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=117620 | Click to view GRAPES (GREEN GRAPES ; AUS) | POQ #117620 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 12
根据您的更新,您不应该将这段代码放在for循环中,而是:
html = BeautifulSoup(open('file.htm'), 'html.parser')
PREFIX = 'www.example.com'
template = '{prefix}{url} | {title}'.format
# inside this list comprehension is your for loop implied
links = [template(prefix=PREFIX, url=e.get('href'), title=e.get('title'))
for e in html.find_all('a', href=True)]
答案 1 :(得分:1)
这不是regexp的任务。您可以使用在odradek的答案中提供的BeautifulSoup,或者有我最喜欢的替代lxml
,在我看来,这会产生更易读的代码:
from lxml import etree
tree = etree.fromstring(html)
for element in tree.xpath('//a'):
print('www.example.com' + element.get('href'))
print('title: ' + element.get('title'))