如果我正在寻找关键字“sales”,即使文件中有多个链接,我也想获得最近的“http://www.somewebsite.com”。我想最近的链接不是第一个链接。这意味着我需要搜索关键字匹配之前的链接。
这不起作用......
regex = (http|https)://[-A-Za-z0-9./]+.*(?!((http|https)://[-A-Za-z0-9./]+))sales
sales
找到最接近关键字的链接的最佳方式是什么?
答案 0 :(得分:3)
使用HTML解析器而不是正则表达式通常更容易,更健壮。
使用第三方模块lxml:
import lxml.html as LH
content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''
doc = LH.fromstring(content)
for url in doc.xpath('''
//*[contains(text(),"sales")]
/preceding::*[starts-with(@href,"http")][1]/@href'''):
print(url)
产量
http://www.somewebsite.com
我发现lxml(和XPath)是表达我正在寻找的元素的便捷方式。但是,如果无法安装第三方模块,您还可以使用标准库中的HTMLParser完成此特定作业:
import HTMLParser
import contextlib
class MyParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.last_link = None
def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
if 'href' in attrs:
self.last_link = attrs['href']
content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''
idx = content.find('sales')
with contextlib.closing(MyParser()) as parser:
parser.feed(content[:idx])
print(parser.last_link)
关于lxml解决方案中使用的XPath:XPath具有以下含义:
//* # Find all elements
[contains(text(),"sales")] # whose text content contains "sales"
/preceding::* # search the preceding elements
[starts-with(@href,"http")] # such that it has an href attribute that starts with "http"
[1] # select the first such <a> tag only
/@href # return the value of the href attribute
答案 1 :(得分:0)
我认为你不能单独使用正则表达式(特别是在关键字匹配之前查看),因为它没有比较距离的感觉。
我认为你最好做这样的事情:
sales
&amp;获取子串索引,称为salesIndex
https?://[-A-Za-z0-9./]+
的所有出现并获取子串索引,称为urlIndex
salesIndex
。对于i
中的每个位置salesIndex
,找到最近的urlIndex
。根据您想要判断“最接近”的方式,您可能需要获取sales
和http...
出现的起始和结束索引进行比较。即,找到最接近当前出现的sales
的起始索引的URL的结束索引,并找到最接近当前出现的{{1的结束索引的URL的起始索引选择一个更接近的那个。
您可以使用sales
获取匹配列表,然后使用matches = re.finditer(pattern,string,re.IGNORECASE)
获取match.span()
中每个match
的开始/结束子字符串索引。
答案 2 :(得分:0)
建立在math.coffee建议的基础上,您可以尝试以下几点:
import re
myString = "" ## the string you want to search
link_matches = re.finditer('(http|https)://[-A-Za-z0-9./]+',myString,re.IGNORECASE)
sales_matches = re.finditer('sales',myString,re.IGNORECASE)
link_locations = []
for match in link_matches:
link_locations.append([match.span(),match.group()])
for match in sales_matches:
match_loc = match.span()
distances = []
for link_loc in link_locations:
if match_loc[0] > link_loc[0][1]: ## if the link is behind your keyword
## append the distance between the END of the keyword and the START of the link
distances.append(match_loc[0] - link_loc[0][1])
else:
## append the distance between the END of the link and the START of the keyword
distances.append(link_loc[0][0] - match_loc[1])
for d in range(0,len(distances)-1):
if distances[d] == min(distances):
print ("Closest Link: " + link_locations[d][1] + "\n")
break
答案 3 :(得分:-1)
我测试了这段代码,它似乎正在运作......
def closesturl(keyword, website):
keylist = []
urllist = []
closest = []
urls = []
urlregex = "(http|https)://[-A-Za-z0-9\\./]+"
urlmatches = re.finditer(urlregex, website, re.IGNORECASE)
keymatches = re.finditer(keyword, website, re.IGNORECASE)
for n in keymatches:
keylist.append([n.start(), n.end()])
if(len(keylist) > 0):
for m in urlmatches:
urllist.append([m.start(), m.end()])
if((len(keylist) > 0) and (len(urllist) > 0)):
for i in range (0, len(keylist)):
closest.append([abs(urllist[0][0]-keylist[i][0])])
urls.append(website[urllist[0][0]:urllist[0][1]])
if(len(urllist) >= 1):
for j in range (1, len(urllist)):
if((abs(urllist[j][0]-keylist[i][0]) < closest[i])):
closest[i] = abs(keylist[i][0]-urllist[j][0])
urls[i] = website[urllist[j][0]:urllist[j][1]]
if((abs(urllist[j][0]-keylist[i][0]) > closest[i])):
break # local minimum / inflection point break from url list
if((len(keylist) > 0) and (len(urllist) > 0)):
return urls #return website[urllist[index[0]][0]:urllist[index[0]][1]]
else:
return ""
somestring = "hey whats up... http://www.firstlink.com some other test http://www.secondlink.com then mykeyword"
keyword = "mykeyword"
print closesturl(keyword, somestring)
以上运行时显示... http://www.secondlink.com
。
如果有人知道如何加速这段代码会很棒!
由于 V $小时。