Question

嘿伙计们，我真的试图在抓取网站时理解正则表达式，我一直在我的代码中使用它足以拉下来，但我被困在这里。我需要快速抓住这个：

http://www.example.com/online/store/TitleDetail?detail&sku=123456789

来自：

('<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t            \tcheck store inventory\r\n\t\t\t            </a>', 1)

这是我感到困惑的地方。任何想法？

编辑：每个产品的sku编号都会发生变化，所以这对我来说是个麻烦

Answer 1

http://www\.example\.com/online/store/TitleDetail\?detail&sku=\d+

将\ d组与“贪婪”+一起使用，以限定sku字段中的任何整数值

Answer 2

你不需要正则表达式，只需使用字符串方法：

result = html[0].split("window.location='")[1].split("'")[0]

Answer 3

pattern = re.compile(r"window.location=\\'([^\\]*)")
haystack = r"""<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t\tcheck store inventory\r\n\t\t\t</a>"""
url = re.search(pattern, haystack).group(1)

Answer 4

如果总有9位

http://www.example.com/online/store/TitleDetail?detail&sku=[0-9]{9}

如果有任意数字的位数：

http://www.example.com/online/store/TitleDetail?detail&sku=[0-9]*

更一般：

http*?sku=[0-9]*

（？in *？表示它会先找到较短的匹配，因此不太可能找到跨多个网址的匹配。）

编辑：[0-9]。不是[1-9]

Answer 5

http://txt2re.com/可能会帮助您

如何使用正则表达式来拉取子串？（屏幕抓取）

5 个答案:

如何使用正则表达式来拉取子串？ （屏幕抓取）

5 个答案:

如何使用正则表达式来拉取子串？（屏幕抓取）