Question

我正在编写一个爬虫来获取html文件的某些部分。但我无法弄清楚如何使用re.findall（）。

这是一个例子，当我想在文件中找到所有...部分时，我可能会写这样的东西：

re.findall("<div>.*\</div>", result_page)

如果result_page是字符串"<div> </div> <div> </div>"，则结果为

['<div> </div> <div> </div>']

只有整个字符串。这不是我想要的，我期待两个div分开。我该怎么办？

Answer 1

引用John Resign Post，

var didScroll = false; $(window).scroll(function() { didScroll = true; }); setInterval(function() { if ( didScroll ) { didScroll = false; var $el; //Same that all the if else statements switch (true) { case (scroll >= 591 && scroll <= 1380): $el = $("#menu-item-26 a"); break; case (scroll >= 1381 && scroll <= 2545): $el = $("#menu-item-22 a"); break; case (scroll >= 2546 && scroll <= 2969): $el = $("#menu-item-23 a"); break; case (scroll >= 2970): $el = $("#menu-item-24 a"); break; default: //scroll<=590 $el = $("#menu-item-25 a"); } //Removing blue class from all links $('.menu a').removeClass('blue'); //Adding blue class to the matched element $el.addClass('blue'); } }, 50);，'*'和'+'限定符都是贪婪的;他们匹配得那么多文本尽可能。在限定符后添加'?'使其执行以非贪婪或极简的方式匹配;尽可能少的字符将匹配。

只需添加问号：

'?'

此外，您不应该使用RegEx来解析HTML，因为HTML解析器就是为此而制作的。使用the documentation的示例：

In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']

Answer 2

In [7]: import bs4 In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')] Out[8]: ['<div> </div>', '<div> </div>']是BeautifulSoup 4运算符，您希望*用于非贪婪匹配。

*?

或者使用诸如BeautifulSoup之类的解析器而不是正则表达式来完成此任务：

re.findall("<div>.*?</div>", result_page)

python RE findall（）返回值是一个完整的字符串

2 个答案: