In Python I have a string containing the sourcecode of a website. Within this sourcecode I want to get the link within an tag, if the tag contains a specific substring.
The input e.g. looks like this:
AnyKindOfString <a href="http://www.link-to-get.com">SearchString</a> AndEvenMoreString
So what I want to tell Python is to search for SearchString
in the all tags within string and give me the first found http://www.link-to-get.com
back.
This should only work, if SearchString
is within the tag - and it should also work, if "SearchString" is part (substring) of http://www.link-to-get.com
.
I'm searching for an answer like more than 30 minutes know and the only thing I found for Python was simply to extract every (or only external or only internal) links from a string.
Anyone having an idea?
Thx in advance!
答案 0 :(得分:1)
使用BeautifulSoup 3.2.1和python 2.7
from BeautifulSoup import BeautifulSoup
search_string = 'SearchString'
website_source = '<a href="http://www.link-to-get.com">SearchString</a> <a href="http://www.link-to-get.com">OtherString</a>\
<a href="http://www.link-to-getSearchString.com">otherString</a>'
soup = BeautifulSoup(website_source)
# this will return a list of lists that has the url's and the name for the link
anchors = [[row['href'], row.text] for row in soup.findAll('a') if row['href'].find(search_string) <> -1 or search_string in row.text]
# prints whole list
print anchors
#prints first list
print anchors[0]
# prints the url for the first list
print anchors[0][0]
问题似乎是我使用BeautifulSoup 3.2.1测试了上面的内容,它只适用于python 2.x,而你使用的是python 3.4,因此错误。
如果您安装BeautifulSoup4并尝试以下代码,它应该工作。另请注意,BeautifulSoup4适用于2.x和3.x.
请注意,以下内容尚未经过测试。
from bs4 import BeautifulSoup
search_string = 'SearchString'
website_source = '<a href="http://www.link-to-get.com">SearchString</a> <a href="http://www.link-to-get.com">OtherString</a>\
<a href="http://www.link-to-getSearchString.com">otherString</a>'
soup = BeautifulSoup(website_source)
# this will return a list of lists that has the url's and the name for the link
anchors = [[row['href'], row.text] for row in soup.findAll('a') if row['href'].find(search_string) != -1 or search_string in row.text]
# prints whole list
print(anchors)
# prints first list
print(anchors[0])
# prints the url for the first list
print(anchors[0][0])
答案 1 :(得分:0)
我已经粗略化了一些应该有用的代码,至少它适用于你给出的示例字符串。
myString = 'AnyKindOfString <a href="http://www.link-to-get.com">SearchString</a> AndEvenMoreString'
theLimit = len(myString)
searchStringLinkPairs = []
tempStr = myString[:]
i =0
while i < theLimit:
startLoc = tempStr.find('<a')
endLoc = tempStr.find("</a")
print startLoc,"\t",endLoc
subStr = tempStr[startLoc:endLoc]
startLink = subStr.find("\"")
subTwo = subStr[startLink+1:]
endLink = subTwo.find("\"")
myLink = subStr[startLink+1:startLink+1+endLink]
searchStringStart = subStr.find(">")
searchString = subStr[searchStringStart+1:endLoc]
if myLink != "" and searchString != "":
searchStringLinkPairs.append([myLink, searchString])
tempStr = tempStr[endLoc+1:]
i = endLoc
if startLoc == -1 or endLoc == -1:
i = 10 * theLimit
print searchStringLinkPairs
答案 2 :(得分:0)
可以在pyquery
(http://pythonhosted.org/pyquery/index.html)+ lxml
(http://lxml.de/tutorial.html)的帮助下完成,如下所示
from pyquery import PyQuery as pq
from lxml import etree
pq_obj = pq(etree.fromstring('<body><p>AnyKindOfString <a href="http://www.link-to-get.com">SearchString</a> AndEvenMoreString</p><p>this is another string goes here</p><a> other</a></body>'))
search_string = 'SearchString'
links = pq_obj('a')
for link in links:
if search_string in link.text:
attrib = link.attrib
print attrib.get('href')
# output
# http://www.link-to-get.com