无法理解正则表达式方法findall

时间:2014-09-15 01:16:25

标签: python html regex html-parsing findall

我是Python的新手,我在学校的教授认为每个人都理解他发布的代码,但是我在使用方法搜索名来查找HTML文件中的某个模式时遇到了麻烦。这是他发布的代码,他声称这样做。我不知道符号在findall方法调用中的含义。

def searchnames(cont):
try:
    info = re.findall('(\d+)\s(\w+)(\d+,\d+|\d+)\n\s(\w+)\n(\d+,\d+|\d+)', cont)
    return info
except:
    print "couldn't find child info"
pass

,其中cont是包含此

的HTML文件
<head><title>Popular Baby Names</title>
<meta name="dc.language" scheme="ISO639-2" content="eng">
<meta name="dc.creator" content="OACT">
<meta name="lead_content_manager" content="JeffK">
<meta name="coder" content="JeffK">
<meta name="dc.date.reviewed" scheme="ISO8601" content="2006-03-10">
<link rel="stylesheet" href="../OACT/templatefiles/master.css" type="text/css" media="screen">
<link rel="stylesheet" href="../OACT/templatefiles/print.css" type="text/css" media="print">
</head>
<body bgcolor="#ffffff" text="#000000" topmargin="1" leftmargin="0">
<table width="100%" border="0" cellspacing="0" cellpadding="4">
  <tbody>
  <tr> 
    <td class="sstop" valign="bottom" align="left" width="25%">
      Social Security Online
    </td>
    <td valign="bottom" class="titletext">
      <!-- sitetitle -->Popular Baby Names
    </td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="1"></td></tr>
  <tr>
    <td class="graystars" width="25%" valign="top">
       <a href="../OACT/babynames/">Popular Baby Names</a></td>
    <td valign="top"> 
      <a href="http://www.ssa.gov/"><img src="/templateimages/tinylogo.gif"
      width="52" height="47" align="left"
      alt="SSA logo: link to Social Security home page" border="0"></a>
      <h1>Popular Names by Birth Year</h1>September 11, 2014</td>
  </tr>
</tbody></table>
<script type="text/javascript" src="../OACT/babynames/chkinput.js"></script>

<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
  <tr valign="top">
    <td width="25%" class="greycell">
      <a href="../OACT/babynames/background.html">Background information</a>
      <p><br />
      &nbsp; Select another <label for="yob">year of birth</label>?<br />
      <form name="popnames" method="post" action="/cgi-bin/popularnames.cgi"
       onSubmit="return submitIt();">
      &nbsp; <input type="text" name="year" id="yob" size="4" value="2012"><input type="hidden" name="top" value="25"><input type="hidden" name="number" value="">
      &nbsp; <input type="submit" value="   Go  "></form>
    </td>
    <td><p align="center"><table width="$tablewidth" border="1" bordercolor="#aaabbb" cellpadding="2" cellspacing="0" summary="Popularity for top 25">
    <caption><h2>Popularity in 2012</h2></caption>
    <tr align="center" valign="bottom">
      <th scope="col" width="12%" bgcolor="#efefef">Rank</th>
      <th scope="col" width="$colwidth" bgcolor="#99ccff">Male name</th>
    <th scope="col" bgcolor="pink" width="41%">Female name</th></tr>
<tr align="right">
 <td>1</td> <td>Jacob</td> <td>Sophia</td>
</tr>
<tr align="right">
 <td>2</td> <td>Mason</td> <td>Emma</td>
</tr>
<tr align="right">
 <td>3</td> <td>Ethan</td> <td>Isabella</td>
</tr>
<tr align="right">
 <td>4</td> <td>Noah</td> <td>Olivia</td>
</tr>
<tr align="right">
 <td>5</td> <td>William</td> <td>Ava</td>
</tr>
<tr align="right">
 <td>6</td> <td>Liam</td> <td>Emily</td>
</tr>
<tr align="right">
 <td>7</td> <td>Michael</td> <td>Abigail</td>
</tr>
<tr align="right">
 <td>8</td> <td>Jayden</td> <td>Mia</td>
</tr>
<tr align="right">
 <td>9</td> <td>Alexander</td> <td>Madison</td>
</tr>
<tr align="right">
 <td>10</td> <td>Aiden</td> <td>Elizabeth</td>
</tr>
<tr align="right">
 <td>11</td> <td>Daniel</td> <td>Chloe</td>
</tr>
<tr align="right">
 <td>12</td> <td>Matthew</td> <td>Ella</td>
</tr>
<tr align="right">
 <td>13</td> <td>Elijah</td> <td>Avery</td>
</tr>
<tr align="right">
 <td>14</td> <td>James</td> <td>Addison</td>
</tr>
<tr align="right">
 <td>15</td> <td>Anthony</td> <td>Aubrey</td>
</tr>
<tr align="right">
 <td>16</td> <td>Benjamin</td> <td>Lily</td>
</tr>
<tr align="right">
 <td>17</td> <td>Joshua</td> <td>Natalie</td>
</tr>
<tr align="right">
 <td>18</td> <td>Andrew</td> <td>Sofia</td>
</tr>
<tr align="right">
 <td>19</td> <td>Joseph</td> <td>Charlotte</td>
</tr>
<tr align="right">
 <td>20</td> <td>David</td> <td>Zoey</td>
</tr>
<tr align="right">
 <td>21</td> <td>Jackson</td> <td>Grace</td>
</tr>
<tr align="right">
 <td>22</td> <td>Logan</td> <td>Hannah</td>
</tr>
<tr align="right">
 <td>23</td> <td>Christopher</td> <td>Amelia</td>
</tr>
<tr align="right">
 <td>24</td> <td>Gabriel</td> <td>Harper</td>
</tr>
<tr align="right">
 <td>25</td> <td>Samuel</td> <td>Lillian</td>
</tr>
<tr><td colspan="3"><small>Note: Rank 1 is the most popular,
rank 2 is the next most popular, and so forth. 
</table></p>
</td></tr></table>
<table class="printhide" width="100%" border="0" cellpadding="1" cellspacing="0">
  <tr bgcolor="#333366"><td height="1" colspan="2"></td></tr>
  <tr>
    <td width="26%" valign="middle">&nbsp;</td>
    <td valign="top" class="seventypercent">
       <a href="http://www.ssa.gov/privacy.html">Privacy Policy</a>&nbsp;
     | <a href="http://www.ssa.gov/websitepolicies.htm">Website Policies
        &amp; Other Important Information</a>&nbsp;
     | <a href="http://www.ssa.gov/sitemap.htm">Site Map</a></td>
  </tr>
</table>
</body></html>

我无法理解如何找到等级或小孩,即等级或名称, 什么时候我尝试运行程序信息是一个空集。我不明白为什么。任何帮助都会很好 这是我的整个计划

    import re
def searchtitle(cont):
    try:        
        title = re.search('Popularity\sin\s(\d\d\d\d)', cont)
        return title.group(0)
    except:
        print "couldn't find title"  
    pass
def searchnames(cont):
    try:
        info = re.findall('(\d+)\s(\w+)(\d+,\d+|\d+)\n\s(\w+)\n(\d+,\d+|\d+)', cont)
        return info
    except:
        print "couldn't find child info"
    pass
if __name__ == '__main__':
    try:
        file = open('Popular_Baby_Names.html')
        cont = file.read()
        file.close()
        ti = searchtitle(cont)
        info = searchnames(cont)
        print ti
        print info
    except:
        print "file couldn't be found"
        SystemExit
    ranks = []
    boysnames = []
    girlsnames = []
    girlsfreq = []
    boysfreq = []
    for info in info:
        ranks.append(int(info[0]))
        boysnames.append(info[1])
        boysfreq.append(int(info[2].replace(',', '')))
        girlsnames.append(info[3])
        girlsfreq.append(int(info[4].replace(',', '')))
    print ranks
    print boysnames
    print boysfreq
    print girlsnames
    print girlsfreq
    print ranks[0] + ranks[1]

    pass

2 个答案:

答案 0 :(得分:3)

或者,如果您不想深入了解正则表达式的精彩世界,请使用HTML解析器解析HTML。

使用BeautifulSoup的示例:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('Popular_Baby_Names.html'))
table = soup.find('table', summary='Popularity for top 25')

boys = []
girls = []
for row in table('tr')[1:-1]:
    cells = row('td')
    boys.append(cells[1].text)
    girls.append(cells[2].text)

print boys
print girls

打印:

[u'Jacob', u'Mason', u'Ethan', u'Noah', u'William', ... ]
[u'Sophia', u'Emma', u'Isabella', u'Olivia', u'Ava', ... ]

此外,请参阅以下有很多理由说明为什么不应该使用正则表达式解析HTML:

答案 1 :(得分:1)

我想你想要这样的东西,

<.*?>(\d{1,2})<.*?>\s*<.*?>(.*?)<.*?>\s*<.*?>(.*?)<.*?>

DEMO

它会捕获并存储第1组中的排名,第2组中的男生名和第3组中的女生名。

Explanation