使用Regex搜索关键字附近的HTML链接

时间:2012-01-23 01:05:22

标签: python regex negative-lookahead

如果我正在寻找关键字“sales”,即使文件中有多个链接,我也想获得最近的“http://www.somewebsite.com”。我想最近的链接不是第一个链接。这意味着我需要搜索关键字匹配之前的链接。

这不起作用......

regex = (http|https)://[-A-Za-z0-9./]+.*(?!((http|https)://[-A-Za-z0-9./]+))sales sales

找到最接近关键字的链接的最佳方式是什么?

4 个答案:

答案 0 :(得分:3)

使用HTML解析器而不是正则表达式通常更容易,更健壮。

使用第三方模块lxml

import lxml.html as LH

content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''

doc = LH.fromstring(content)    
for url in doc.xpath('''
    //*[contains(text(),"sales")]
    /preceding::*[starts-with(@href,"http")][1]/@href'''):
    print(url)

产量

http://www.somewebsite.com

我发现lxml(和XPath)是表达我正在寻找的元素的便捷方式。但是,如果无法安装第三方模块,您还可以使用标准库中的HTMLParser完成此特定作业:

import HTMLParser
import contextlib

class MyParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.last_link = None

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if 'href' in attrs:
            self.last_link = attrs['href']

content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''

idx = content.find('sales')

with contextlib.closing(MyParser()) as parser:
    parser.feed(content[:idx])
    print(parser.last_link)

关于lxml解决方案中使用的XPath:XPath具有以下含义:

 //*                              # Find all elements
   [contains(text(),"sales")]     # whose text content contains "sales"
   /preceding::*                  # search the preceding elements 
     [starts-with(@href,"http")]  # such that it has an href attribute that starts with "http"
       [1]                        # select the first such <a> tag only
         /@href                   # return the value of the href attribute

答案 1 :(得分:0)

我认为你不能单独使用正则表达式(特别是在关键字匹配之前查看),因为它没有比较距离的感觉。

我认为你最好做这样的事情:

  • 找到所有sales&amp;获取子串索引,称为salesIndex
  • 查找https?://[-A-Za-z0-9./]+的所有出现并获取子串索引,称为urlIndex
  • 循环浏览salesIndex。对于i中的每个位置salesIndex,找到最近的urlIndex

根据您想要判断“最接近”的方式,您可能需要获取saleshttp...出现的起始结束索引进行比较。即,找到最接近当前出现的sales的起始索引的URL的结束索引,并找到最接近当前出现的{{1的结束索引的URL的起始索引选择一个更接近的那个。

您可以使用sales获取匹配列表,然后使用matches = re.finditer(pattern,string,re.IGNORECASE)获取match.span()中每个match的开始/结束子字符串索引。

答案 2 :(得分:0)

建立在math.coffee建议的基础上,您可以尝试以下几点:

import re
myString = "" ## the string you want to search

link_matches = re.finditer('(http|https)://[-A-Za-z0-9./]+',myString,re.IGNORECASE)
sales_matches = re.finditer('sales',myString,re.IGNORECASE)

link_locations = []

for match in link_matches:
    link_locations.append([match.span(),match.group()])

for match in sales_matches:
    match_loc = match.span()
    distances = []
    for link_loc in link_locations:
        if match_loc[0] > link_loc[0][1]: ## if the link is behind your keyword
            ## append the distance between the END of the keyword and the START of the link
            distances.append(match_loc[0] - link_loc[0][1])
        else:
            ## append the distance between the END of the link and the START of the keyword
            distances.append(link_loc[0][0] - match_loc[1])

    for d in range(0,len(distances)-1):
        if distances[d] == min(distances):
            print ("Closest Link: " + link_locations[d][1] + "\n")
            break

答案 3 :(得分:-1)

我测试了这段代码,它似乎正在运作......

def closesturl(keyword, website):
    keylist = []
    urllist = []
    closest = []
    urls = []
    urlregex = "(http|https)://[-A-Za-z0-9\\./]+"
    urlmatches = re.finditer(urlregex, website, re.IGNORECASE)
    keymatches = re.finditer(keyword, website, re.IGNORECASE)
    for n in keymatches:
        keylist.append([n.start(), n.end()])
    if(len(keylist) > 0):
        for m in urlmatches:
            urllist.append([m.start(), m.end()])
    if((len(keylist) > 0) and (len(urllist) > 0)):
        for i in range (0, len(keylist)):
            closest.append([abs(urllist[0][0]-keylist[i][0])])
            urls.append(website[urllist[0][0]:urllist[0][1]])
            if(len(urllist) >= 1):
                for j in range (1, len(urllist)):
                    if((abs(urllist[j][0]-keylist[i][0]) < closest[i])):
                        closest[i] = abs(keylist[i][0]-urllist[j][0])
                        urls[i] = website[urllist[j][0]:urllist[j][1]]
                        if((abs(urllist[j][0]-keylist[i][0]) > closest[i])):
                            break # local minimum / inflection point break from url list                                                      
    if((len(keylist) > 0) and (len(urllist) > 0)):
        return urls #return website[urllist[index[0]][0]:urllist[index[0]][1]]                                                                
    else:
        return ""

    somestring = "hey whats up... http://www.firstlink.com some other test http://www.secondlink.com then mykeyword"
    keyword = "mykeyword"
    print closesturl(keyword, somestring)

以上运行时显示... http://www.secondlink.com

如果有人知道如何加速这段代码会很棒!

由于 V $小时。