Question

我是正则表达式的新手，所以我希望这不是一个明显的问题

我正在寻找craigslist公寓列表中的邻居。这个社区就像这样列出

(castro / upper market)
</h2>

以下是html的一个例子......

<a class="backup" disabled="disabled">&#9650;</a>
<a class="next" disabled="disabled"> next &#9654;</a>
</span>

</section>

<h2 class="postingtitle">
<span class="star"></span>
&#x0024;5224 / 2br - Stunning Furnished 2BR with Hardwwod Floors &amp; Newly  renovated Kitchen (pacific heights)
</h2>
<section class="userbody">
<figure class="iw">


<div class="slidernav">
    <button class="sliderback">&lt;</button>
    <span class="sliderinfo"></span>
    <button class="sliderforward">&gt;</button>

这应该找到所有不同的社区

但是在html的整页上它需要太长时间

\w+\s?(\/)?\s?\w+\s?(\/)?\s?\w+\s?(\/)?\s?\w+\)\n<\/h2>

# \w+ to find the word 
# \s?(\/)?\s? for a space or space, forward slash, space
# \n<\/h2> because </h2> is uniquely next to the neighborhood in the html

有没有办法找到

</h2>

然后查看附近的文本字符串？

非常感谢任何帮助或指导我朝着正确的方向前进

Answer 1

使用 HTML Parser 提取标题（h2标记内容），然后使用正则表达式提取邻域（括号内的文本）。

示例（使用BeautifulSoup HTML parser）：

import re
from bs4 import BeautifulSoup
import requests

response = requests.get('http://sfbay.craigslist.org/sfc/apa/4849806764.html')
soup = BeautifulSoup(response.content)

pattern = re.compile(r'\((.*?)\)$')
text = soup.find('h2', class_='postingtitle').text.strip()
print pattern.search(text).group(1)

打印pacific heights。

注意$(.*?)$$正则表达式 - 它将capture括号内的所有内容直接位于字符串结尾之前。

使用Scrapy web-scraping framework，您可以在一行中解决它，因为Selector有built-in support for regular expressions。 “Scrapy shell”中的示例：

$ scrapy shell http://sfbay.craigslist.org/sfc/apa/4849806764.html
In [1]: response.xpath('//h2[@class="postingtitle"]/text()').re(r'\((.*?)\)$')[0]
Out[1]: u'pacific heights'

另请参阅正则表达式不应用于HTML解析的一百个原因：

RegEx match open tags except XHTML self-contained tags

Answer 2

如何使用string.find查找正则表达式索引，然后在该索引处返回负值。

 In [1]: import re

 In [2]: c = "123456</h2>7890"

 In [3]: x = c.find("</h2>")

 In [4]: print c[x-6:x]
 123456

Answer 3

假设您的HTML存储在名为page的变量中，这种模式怎么样？

re.findall("\(([^\(\)]+)\)\n<\/h2>", page)

为了更好的衡量，也允许额外的空间：

re.findall("\(([^\(\)]+)\)\s*\n\s*<\/h2>", page)

最后，预编译自动机：

neighborhoods = re.compile( "\(([^\(\)]+)\)\s*\n\s*<\/h2>")

# somewhere else, for each page 
for nh in neighborhoods.findall(page):
    print(nh)

对于您的示例HTML页面，这将打印以下唯一邻居的列表：

pacific heights

如果每页只有一个位置，那么re.search()会更快。请记住search()生成一个中间匹配对象，而不是字符串本身。

正则表达式：找到一个字符串，然后看看后面

3 个答案: