Question

我正在尝试从网页中提取一些联系人详细信息，并使用Beautiful Soup成功提取了一些信息。

但我无法提取一些数据，因为它没有正确构造（html）。所以我使用正则表达式。但是最后几个小时我正在努力学习正则表达式而且我有点震惊。

 InstanceBeginEditable name="additional_content" 
<h1>Contact details</h1>
<h2>Diploma coordinator</h2>


                                Mr. Matthew Schultz<br />
<br />
                                    610 Maryhill Drive<br />


                                Green Bay<br />
                                WI<br />
                                United States<br />
                                54303<br />
Contact by email</a><br />
                                Phone (1) 920 429 6158          
                                <hr /><br />

我需要提取，

先生。马修舒尔茨

610 Maryhill Drive 绿湾 WI 美国 54303

和电话号码。我尝试过从谷歌搜索中找到的东西。但没有一个可行（因为我的知识很少，但这是我最后的努力。

con = ""
for content in contactContent.contents:
    con += str(content)

print con

address = re.search("Mr.\b[a-zA-Z]", con)

print str(address)

有时候我没有。

请帮帮我们！

PS。内容在网上免费提供没有版权侵犯。

Answer 1

好的，使用您的数据编辑将解析例程嵌入到函数中

def parse_list(source):
    lines = ''.join( source.split('\n') )
    lines = lines[ lines.find('</h2>')+6 : lines.find('Contact by email') ]                   
    lines = [ line.strip()
              for line in lines.split('<br />')
              if line.strip() != '']
    return lines

# Parse the page and retrieve contact string from the relevant <div>
con = ''' InstanceBeginEditable name="additional_content" 
<h1>Contact details</h1>
<h2>Diploma coordinator</h2>


                                Mr. Matthew Schultz<br />
<br />
                                    610 Maryhill Drive<br />


                                Green Bay<br />
                                WI<br />
                                United States<br />
                                54303<br />
Contact by email</a><br />
                                Phone (1) 920 429 6158          
                                <hr /><br />'''


# Extract details and print to console

details = parse_list(con)
print details

这将输出一个列表：

['Mr. Matthew Schultz', '610 Maryhill Drive', 'Green Bay', 'WI', 'United States', '54303']

Answer 2

你问过用正则表达式做这件事。假设您为每个div获取了包含此数据的新多行字符串，您可以提取如下数据：

import re

m = re.search('</h2>\s+(.*?)<br />\s+<br />\s+(.*?)<br />\s+(.*?)<br />\s+(.*?)<br />\s+(.*?)<br />\s+(.*?)<br />', con )
if m:
    print m.groups()

输出：

('Mr. Matthew Schultz', '610 Maryhill Drive', 'Green Bay', 'WI', 'United States', '54303')

我看到你正式开始使用正则表达式。正则表达式的关键是要记住，您通常要定义一个数字或一组数字，然后是数量表达式，告诉它您希望表达式重复多少次。在这种情况下，我们从</h2>开始，然后是\s+，它告诉正则表达式引擎我们需要一个或多个空格字符（包括换行符）。这里唯一的另一个细微差别是下一个表达式(.*?)是一个懒惰的捕获全部 - 它将抓取任何东西，直到它遇到下一个<br />的表达式。

编辑：此外，您应该能够利用以下事实清理正则表达式：名称后所有地址信息都是统一格式。我玩了一点但是没有得到它，如果你想改进它，这将是一种方法。

Python中的正则表达式来提取数据

2 个答案: