我尝试创建一个正则表达式来提取电话,街道地址,页面值(9440717256,H.No.3-11-62 ,RTC Colony ..)来自python中的html页面。这三个字段是可选的我尝试了这个正则表达式,但输出不一致
telephone\S+>(.+)</em>.*(?:streetAddress\S+(.+)</span>)?.*(?:pages\S+>(.+)</a></span>)?
示例字符串
<em phone="**telephone**">9440717256</em></div></div></li><li class="row"><i class="icon-sm icon-address"></i><div class="profile-details"><strong>Address</strong><div class="profi`enter code here`le-child"><address itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress" class="data-item"><span itemprop="**streetAddress**">H.No. 3-11-62, RTC Colony</span>, <span>Vastu Colony, </span><span class="text-black" itemprop="addressLocality"><a href="/hyderabad/lal-bahadur-nagar/allcategory.aspx" title="**Pages**">Lal Bahadur Nagar</a></span>
有人可以帮我建立正则表达式吗?
答案 0 :(得分:3)
考虑到您的输入无效HTML 并且可能会有所变化,您可以使用像BeautifulSoup这样的HTML解析器。 但如果您的输入发生变化,则必须调整这些简单的选择器。
from bs4 import BeautifulSoup
h = """<em phone="**telephone**">9440717256</em></div></div></li><li class="row"><i class="icon-sm icon-address"></i><div class="profile-details"><strong>Address</strong><div class="profi`enter code here`le-child"><address itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress" class="data-item"><span itemprop="**streetAddress**">H.No. 3-11-62, RTC Colony</span>, <span>Vastu Colony, </span><span class="text-black" itemprop="addressLocality"><a href="/hyderabad/lal-bahadur-nagar/allcategory.aspx" title="**Pages**">Lal Bahadur Nagar</a></span>"""
soup = BeautifulSoup(h)
修改:由于您现在告诉我们您需要具有指定属性值的元素的文本,因此您可以{{ 3}}
def find_phone(tag):
return tag.has_attr("phone") and tag.get("phone") == "**telephone**"
def find_streetAddress(tag):
return tag.has_attr("itemprop") and tag.get("itemprop") == "**streetAddress**"
def find_pages(tag):
return tag.has_attr("title") and tag.get("title") == "**Pages**"
print(soup.find(find_phone).string)
print(soup.find(find_streetAddress).string)
print(soup.find(find_pages).string)
输出:
9440717256
H.No. 3-11-62, RTC Colony
Lal Bahadur Nagar
答案 1 :(得分:1)
如果您了解HTML提供程序,内部代码看起来像什么,可以使用正则表达式。
然后,只使用替换和命名捕获组。
telephone[^>]*>(?P<Telephone>[^<]+)|streetAddress[^>]*>(?P<Address>[^<]+)|Pages[^>]*>(?P<Pages>[^<]+)
请参阅demo
如果>
未序列化,您可以使用此正则表达式(更通用的一个,编辑:现在,详细):
telephone[^<]*> # Looking for telephone
(?P<Telephone>[^<]+) # Capture telephone (all text up to the next tag)
|
streetAddress[^<]*> # Looking for streetAddress
(?P<Address>[^<]+) # Capture address (all text up to the next tag)
|
Pages[^<]*> # Looking for Pages
(?P<Pages>[^<]+) # Capture Pages (all text up to the next tag)
粘贴正则表达式代码部分:
p = re.compile(ur'''telephone[^<]*> # Looking for telephone
(?P<Telephone>[^<]+) # Capture telephone (all text up to the next tag)
|
streetAddress[^<]*> # Looking for streetAddress
(?P<Address>[^<]+) # Capture address (all text up to the next tag)
|
Pages[^<]*> # Looking for Pages
(?P<Pages>[^<]+) # Capture Pages (all text up to the next tag)''', re.IGNORECASE | re.VERBOSE)
test_str = "YOUR STRING"
print filter(None, [x.group("Telephone") for x in re.finditer(p, test_str)])
print filter(None, [x.group("Address") for x in re.finditer(p, test_str)])
print filter(None, [x.group("Pages") for x in re.finditer(p, test_str)])
输出(加倍的结果是我用不同的节点顺序复制输入字符串的结果):
[u'9440717256', u'9440717256']
[u'H.No. 3-11-62, RTC Colony', u'H.No. 3-11-62, RTC Colony']
[u'Lal Bahadur Nagar', u'Lal Bahadur Nagar']