Question

我正在为此使用Selenium，我的代码如下：

import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException

driver = webdriver.Firefox()
omegaBase = "https://www.omegawatches.com/de/"          
productRegex = re.compile(r'[https://](w){3}')

driver.get(omegaBase + "watches/" + "constellation")
links = driver.find_elements_by_tag_name("a")
for link in links:
    pageUrls = link.get_attribute("href")
    print(pageUrls)
    productRegex.findall(pageUrls)

如果我将regEx注释掉，而只是print(pageUrls)，我将获得页面上的所有链接，这很好，但是我试图从页面中仅选择少数几个特定链接，格式为{{ 1}}

使用正则表达式不是很好，我肯定需要练习和学习更多，但是我一直在玩，只是想看看它是否会适用，并且不断出现错误https://www.omegawatches.com/de/watch/name_of_product

有人知道我如何修复regEx，以便至少正确应用它。我在上面的示例中使用的regEx实际上只是删除了几个链接，因此我可以看到它至少在起作用。

Answer 1

首先，让我们看一下您的正则表达式。您正在执行此操作：

productRegex = re.compile(r'[https://](w){3}')

构建正则表达式时，方括号中的内容与其中包含的一组字符匹配。例如，[aeiou]仅匹配a，e，i，o或u。在这里，您要匹配字符串https://，因此只需将其放在不带方括号的位置即可：

productRegex = re.compile(r'https://(w){3}')

您可以通过使用^仅匹配表达式的开头，并将(w){3}简化为www来进一步更改它：

productRegex = re.compile(r'^https://www')

现在让我们看看如何使用正则表达式：

for link in links:
    pageUrls = link.get_attribute("href")
    print(pageUrls)
    productRegex.findall(pageUrls)

在这里，您正在使用get_attribute()获取链接的URL。这得到一个URL，因此建议将变量名从pageUrls更改为pageUrl。然后，您需要检查URL是否与正则表达式匹配，就像这样：

if productRegex.match(pageUrl):
    print(pageUrl)
else:
    print('No match')

（当然，到目前为止，我们注意到如果使用^的正则表达式中不需要match()，而var myArray = ["one", "two", "five"];仅在字符串的开头。）

Answer 2

您不需要正则表达式即可执行您要尝试的操作。您可以使用一个简单的CSS选择器。

a[href^='https://www.omegawatches.com/de/watches/']

这只是寻找一个{href =“ {1}}标签，该标签的href是以您想要的URL开头。

您可以进一步对其进行修改，使其集中于特定链接，例如仅页脚中的观看链接，例如

...等等

如何更改regEx，以便将其正确地应用于我要抓取的URL？

2 个答案: