描述

Question

我目前正在构建一些关于ebay拍卖的东西，但是我在阻止它包含“更多相关项目”之后的项目时遇到了一些困难，我显然不想这样做。

目前，所有链接都是标准的href，格式为

<a href="http://www.ebay.co.uk/blahblah" class="vip" title="x" itemprop="name">

class =“vip”在每个项目链接中，所以这似乎是一件好事，但它也在相关项目的链接中，所以我不需要进一步比与部分相关的更多项目

它需要是正则表达式，因为我用ubot制作它（比用真实语言编写的速度快得多） - 对于非常的noob问题抱歉，正则表达式不是我强大的套装。

谢谢！：）

Answer 1

描述

这个正则表达式将：

匹配class属性为vip
捕获这些锚标记的href属性值
将避免有问题的边缘案例
允许class和href按任意顺序显示在锚标记中
未在more to explore部分

<a\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sclass=['"]?vip['"]?)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?>.*?</a>(?=.*?More\sto\sexplore)

enter image description here

PHP代码示例：

示例文字

注意第二行有一些可能存在问题的文本

<a href="http://www.ebay.co.uk/blahblah-11" class="vip" title="x" itemprop="name">text here</a>
<a onmouseover=' var class="vip"  ; funClassSwap(class); ' href="http://www.ebay.co.uk/blahblah-22"><form><input type="image" src="submit.gif"></form></a>
<a class="vip" href="http://www.ebay.co.uk/blahblah-33" title="x" itemprop="name">more text</a>
<div class="seoi-c">
    <h2 class="seoi-h">More to explore</h2>
    <div class="fl">
        <ul class="tso-u">
                <li><a href="http://www.ebay.com/sch/Lathes-/97230/i.html?_dcat=97230&amp;Type=CNC&amp;_trksid=p2045573.m2389" title="Lathes in Metalworking Equipment CNC">Lathes in Metalworking Equipment CNC</a></li>
        </ul>
    </div>
    <div class="fl">
        <ul class="tso-u">
        </ul>
    </div>
</div>
<a class="vip" href="http://www.ebay.co.uk/blahblah-44" title="x" itemprop="name">more text</a>

<强>代码

<?php
$sourcestring="your source string";
preg_match_all('/<a\b(?=\s) # capture the open tag
(?=(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\shref=(\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)) # get the href attribute
(?=(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\sclass=[\'"]?vip[\'"]?) # validate the class attribute
(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"\s]*)*"\s?> # get the entire tag
.*?<\/a>   # capture the entire anchor tag
(?=.*?More\sto\sexplore)  # validate this match is before the 'more to explore' section
/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

<强>匹配

[0][0] = <a href="http://www.ebay.co.uk/blahblah-11" class="vip" title="x" itemprop="name">text here</a>
[0][2] = "http://www.ebay.co.uk/blahblah-11"
[1][0] = <a class="vip" href="http://www.ebay.co.uk/blahblah-33" title="x" itemprop="name">more text</a>
[1][3] = "http://www.ebay.co.uk/blahblah-33"

Answer 2

我发现使用＆＃34;运行JavaScript＆＃34;功能非常有用，当你从页面中删除不想要刮掉的不需要的东西时。找到＆＃34;与＆＃34;相关的更多项目的ID或类别。部分，然后做这样的事情：

x = document.getElementById（＆＃34; more items id＆＃34;）; x.remove（）

这会将其从页面中删除。然后你可以告诉机器人开始刮擦。

找到某个类的所有网址，直到页面上的文字为止

2 个答案:

描述

PHP代码示例：