Question

我正在堆叠使我的正则表达式在Python3.5中工作。我有一个包含大量网址的列表。有些网址很短，有些网址很长。

我可以摘录我想要的网址...但大多只是这个网址不能摘录。

http://www.forbes.com/sites/julianmitchell/2016/09/27/this-startup-uses-drones-to-map-and-manage-massive-construction-projects/#1ca4d634334e

这是代码。

urlList=[]  # Assume there are many URLs in this list. 

interdrone = re.compile(r"http://www.interdrone.com/news/(?:.*)")
hp = re.compile(r"http://www.interdrone.com/$")

restOfThem=re.compile(r'\#|youtube|bzmedia|facebook|twitter|mailto|geoconnexion.com|linkedin|gplus|resources\.sdtimes\.com|precisionagvision')


cleanuplist =[] # Adding URLs I need to this new list.

for i in range(0,len(urlList)):
    if restOfThem.findall(ursList[i]):
        continue

    elif hp.findall(urlList[i]):
        continue

    elif interdrone.findall(urlList[i]):
        cleanuplist.append(urlList[i])

    else:
        cleanuplist.append(urlList[i])

logmsg("Generated Interdrone clean URL list")
return (cleanuplist)

forbes.com网址应该落入＆＃34;否则：＆＃34;子句，所以应该添加到cleanuplist中。但事实并非如此。同样，只有这一个没有添加到新列表中。

我试图通过这个特别挑选福布斯网站，

forbes = re.compile(r"http://www.forbes.com/(?:.*)")

然后，添加以下elif语句。

elif forbes.findall(urlList[i]):
    cleanuplist.append(urlList[i])

然而，它也没有拿起福布斯网站。

因此，我怀疑应用正则表达式是否存在某种最大字符边界（以便跳过findall？）。我错了。我怎样才能摘录上面的forbes.com网站？

Answer 1

您的正则表达式与您提供的网址相匹配，尤其是网址最后一部分中显示的#。这就是它被跳过的原因。没有“字符限制”（除非Python内存不足）。

你需要对正则表达式进行更严格的限制。例如，如果您的网址为http://www.forbes.com/sites/julianmitchell/2016/09/27/twitter-stock-down，该网址与您的正则表达式的twitter部分是否匹配会怎样？

此外，您可能希望使用re.search()，而不是re.findall()。

此外，您似乎不需要最后一个elif子句，因为无论它是否真实都会发生同样的事情。

最后，迭代的正确方法是for url in urlList:而不是使用索引。这是Python，而不是Java。

是否有任何字符正则表达式可以处理的最大长度？

1 个答案: