我正在学习美丽的汤和Python,在这种情况下,我正在使用包含不同年份流行婴儿名称的html文件集(例如baby1990.html等)进行正则表达式谷歌教程的“婴儿名称”练习。 。如果您对此感兴趣,可以找到此数据集:https://developers.google.com/edu/python/exercises/baby-names
html文件包含一个特定的表格,用于存储流行的婴儿名称,其html代码如下:
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
<tr valign="top"><td width="25%" class="greycell">
<a href="../OACT/babynames/background.html">Background information</a>
<p><br />
Select another <label for="yob">year of birth</label>?<br />
<form method="post" action="/cgi-bin/popularnames.cgi">
<input type="text" name="year" id="yob" size="4" value="1990">
<input type="hidden" name="top" value="1000">
<input type="hidden" name="number" value="">
<input type="submit" value=" Go "></form>
</td><td>
<h3 align="center">Popularity in 1990</h3>
我想遍历文件夹中的所有html文件并最终提取存储在标签之间的年份信息(在某些文件中是标签)。
我写了以下代码:
Years = [] # Initializes an empty list where the Years will be
stored
f = files(path) # Calls the function files() defined earlier
pattern = re.compile(r'.+(\d\d\d\d)') # Establishes a regex patter to extract the Year string from each file
for file in f: # loops through the files
try:
with open(file,"r") as f: soup = bs(f, 'lxml') # opens and reads each file in turn from the files list
h = soup.find_all(re.compile("(h2)|(h3)")) # Extracts and stores <h3> and <h2> Tags to h ResultSet object
string = h[0].get_text() #Passes the first element of the ResultSet to a string variable (only one <h> Tag exists)
Years.append(pattern.match(string).group(1)) # Extracts the first match (i.e. Year) and appends it to the list
except:
Years.append('NaN')
continue
Years # Returns the year
此代码返回而不是列表为字符串'NaN'
代码调用的函数files()如下:
def files(path):
# This function returns a list with the full paths (including the file name) of all the files that are stored in a directory
# and whose names match a regex pattern. The functions has as an argument the path of the target directory.
files = [f for f in os.listdir(path)
if re.match(r'.+\.html', f)] # extracts all the filenames matching the pattern and stores them to a list
files = [path + s for s in files] # Concatenates the path string to the name of the files
return files
你能理解代码有什么问题吗?
您的建议将不胜感激。
答案 0 :(得分:0)
你的大多数代码都有效,我刚刚删除了找到html文件的函数,它似乎对我有用。将“path \ to \ file”更改为您的文件夹并尝试此操作。
from bs4 import BeautifulSoup as bs
import glob, os
import re
pattern = re.compile(r'.+(\d\d\d\d)')
os.chdir("path\to\file")
for htmlfile in glob.glob("*.html"):
print "path\to\file"+htmlfile
with open(htmlfile,"r") as f:
soup = bs(f,'lxml')
table_headers = []
header=soup.find_all(re.compile("(h2)|(h3)"))
string = header[0].get_text()
print pattern.match(string).group(1)
输出
baby1990.html
1990
baby1992.html
1992
baby1994.html
1994
baby1996.html
1996
baby1998.html
1998
baby2000.html
2000
baby2002.html
2002
baby2004.html
2004
baby2006.html
2006
baby2008.html
2008