Question

我有一个完全相同的网站链接列表，除了更改的年份（这就是我要查找的年份）。我正在使用re.match尝试找到它，因为字符串与4个字符（20xx）完全相同。由于某种原因，它只返回None，我也不知道为什么。

我尝试使用其他re方法，例如findall和fullmatch，但这无济于事。

state_links = ["https://2009-2017.state.gov/r/pa/prs/ps/2009/index.htm",
               "https://2009-2017.state.gov/r/pa/prs/ps/2010/index.htm",
               "https://2009-2017.state.gov/r/pa/prs/ps/2011/index.htm",
               "https://2009-2017.state.gov/r/pa/prs/ps/2012/index.htm",
               "https://2009-2017.state.gov/r/pa/prs/ps/2013/index.htm",
               "https://2009-2017.state.gov/r/pa/prs/ps/2014/index.htm",
               "https://2009-2017.state.gov/r/pa/prs/ps/2015/index.htm",
               "https://2009-2017.state.gov/r/pa/prs/ps/2016/index.htm"]

for link in state_links:
   year = re.match(r"https://2009-2017.state.gov/r/pa/prs/ps/(.*)/index.htm", link)

   print(year)

Answer 1

如@Drubio所示，您的正则表达式模式正确。但是，请检查您的代码。以下作品：

regex = r"https://2009-2017.state.gov/r/pa/prs/ps/(\d{4})/index.htm"
years = re.finditer(regex, state_links, re.MULTILINE)
for year in years:    
    for j in range(0, len(year.groups())):
        j  += 1       
        print ("{year}".format(year = year.group(j))) 

Output
## 2009 2010 2011 2012 2013 2014 2015 2016

为\d{4}建议/更正以及.split选项向@Brad付费

Answer 2

您所显示的示例有效，打印了一系列re.Match实例。（尽管.并没有按照您认为的方式运行，在捕获组中使用\d{4}可能是更合理的做法。普通的.是任何字符的模式；您可能想要一个字面量，\.。）

无论如何，如果您的链接始终采用整洁的格式，那么您也可以在此处仅使用str方法：

>>> [int(i.rsplit("/", 2)[-2]) for i in state_links]
[2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]

这会将每个链接分为3部分，其中每个中间元素将如下所示：

>>> state_links[0].rsplit("/", 2)
['https://2009-2017.state.gov/r/pa/prs/ps', '2009', 'index.htm']

然后[-2]索引器使用年份部分。

Python重新匹配未在字符串中间找到字符

2 个答案: