美丽的汤代码无法从<h>标记

时间:2017-01-12 17:19:40

标签: python html beautifulsoup

我正在学习美丽的汤和Python,在这种情况下,我正在使用包含不同年份流行婴儿名称的html文件集(例如baby1990.html等)进行正则表达式谷歌教程的“婴儿名称”练习。 。如果您对此感兴趣,可以找到此数据集:https://developers.google.com/edu/python/exercises/baby-names

html文件包含一个特定的表格,用于存储流行的婴儿名称,其html代码如下:

<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
<tr valign="top"><td width="25%" class="greycell">
<a href="../OACT/babynames/background.html">Background information</a>
<p><br />
&nbsp; Select another <label for="yob">year of birth</label>?<br />      
<form method="post" action="/cgi-bin/popularnames.cgi">
&nbsp; <input type="text" name="year" id="yob" size="4" value="1990">
<input type="hidden" name="top" value="1000">
<input type="hidden" name="number" value="">
&nbsp; <input type="submit" value="   Go  "></form>
</td><td>
<h3 align="center">Popularity in 1990</h3>

我想遍历文件夹中的所有html文件并最终提取存储在标签之间的年份信息(在某些文件中是标签)。

我写了以下代码:

    Years = [] # Initializes an empty list where the Years will be
stored
    f = files(path) # Calls the function files() defined earlier
    pattern = re.compile(r'.+(\d\d\d\d)')  # Establishes a regex patter to extract the Year string from each file
    for file in f:  # loops through the files
        try:
            with open(file,"r") as f: soup = bs(f, 'lxml')  # opens and reads each file in turn from the files list
            h = soup.find_all(re.compile("(h2)|(h3)"))  # Extracts and stores <h3> and <h2> Tags to h ResultSet object
            string = h[0].get_text()  #Passes the first element of the ResultSet to a string variable (only one <h> Tag exists)
            Years.append(pattern.match(string).group(1))   # Extracts the first match (i.e. Year) and appends it to the list
        except:
            Years.append('NaN')
            continue
    Years  # Returns the year

此代码返回而不是列表为字符串'NaN'

代码调用的函数files()如下:

def files(path):
# This function returns a list with the full paths (including the file name) of all the files that are stored in a directory
# and whose names match a regex pattern.  The functions has as an argument the path of the target directory.

files = [f for f in os.listdir(path) 
    if re.match(r'.+\.html', f)]  # extracts all the filenames matching the pattern and stores them to a list
files = [path + s for s in files]  # Concatenates the path string to the name of the files
return files

你能理解代码有什么问题吗?

您的建议将不胜感激。

1 个答案:

答案 0 :(得分:0)

你的大多数代码都有效,我刚刚删除了找到html文件的函数,它似乎对我有用。将“path \ to \ file”更改为您的文件夹并尝试此操作。

from bs4 import BeautifulSoup as bs
import glob, os
import re
pattern = re.compile(r'.+(\d\d\d\d)')
os.chdir("path\to\file")
for htmlfile in glob.glob("*.html"):
    print "path\to\file"+htmlfile
    with open(htmlfile,"r") as f: 
        soup = bs(f,'lxml')
        table_headers = []
        header=soup.find_all(re.compile("(h2)|(h3)")) 
        string = header[0].get_text() 
        print pattern.match(string).group(1)

输出

baby1990.html
1990
baby1992.html
1992
baby1994.html
1994
baby1996.html
1996
baby1998.html
1998
baby2000.html
2000
baby2002.html
2002
baby2004.html
2004
baby2006.html
2006
baby2008.html
2008