BeautifulSoup验证“title”td以提取多个表的值

时间:2017-12-15 21:44:33

标签: python html beautifulsoup web-crawler

我正在尝试抓取一个没有标准化输出并且没有任何样式/ id标签的旧网站,它们只是显示如下:

<table BORDER="0" VALIGN="top" CELLPADDING="3" CELLSPACING="0" WIDTH="100%">
	<tr>
		<td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Surname</strong>
		</td>
		<td valign="top">
Bloggs
		</td>
	</tr>
	<tr>
		<td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Forename(s)</strong>
		</td>
		<td valign="top">
Joe
		</td>
	</tr>
	<tr>
		<td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Title</strong>
		</td>
		<td valign="top">
Mr
		</td>
	</tr>
	<tr>
	    <td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Gender</strong>
		</td>
		<td valign="top">
Male
		</td>
	</tr>
	<tr>
		<td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Occupation</strong>
		</td>
		<td valign="top">

		</td>
	</tr>
	<tr>
		<td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Date of Birth</strong>
		</td>
		<td valign="top">
13/05/12
		</td>
	</tr>
</table>

问题是,如果数据库中不存在某个字段,它甚至不显示空行,则会在两个核心数据表之间添加一些额外数据作为额外表,并且在发生这种情况时没有指示。 / p>

我使用Python的方法有点长,但想法是将左侧TD作为标题进行验证并抓住正确的TD,这是相关数据,如下所示:

title, forename, surname, gender, occupation, dob = '', '', '', '', '', ''

tbl1 = soup.findAll('table')[1]

for tr in tbl1.findAll('tr'):
    content = tr.findAll('td')
    if content[0].text.strip() == 'Title':
        title = content[1].text.strip()
    if content[0].text.strip() == 'Forename(s)':
        forename = content[1].text.strip()
    if content[0].text.strip() == 'Surname':
        surname = content[1].text.strip()
    if content[0].text.strip() == 'Gender':
        gender = content[1].text.strip()
    if content[0].text.strip() == 'Occupation':
        occupation = content[1].text.strip()
    if content[0].text.strip() == 'Date of Birth':
        dob = content[1].text.strip()

print('"' + title + '","' + forename + '","' + surname + '","' + gender + '","' + occupation + '","' + dob + '"')

每当我尝试迭代所有表格时,我得到: AttributeError:ResultSet对象没有属性'findAll'。您可能正在处理像单个项目的项目列表。当你打算调用find()时,你调用了find_all()吗?

1 个答案:

答案 0 :(得分:0)

您可以创建标题列表并使用itertools.izip_longest

import itertools
import re
headers = ['title', 'forename', 'surname', 'gender', 'occupation', 'dob']
from bs4 import BeautifulSoup as soup 
s = soup(web_data, 'lxml')
new_s = [re.sub('\n+|\t+', '', i.text) for i in s.findAll('td')]
final_data = {a:b for a, b in itertools.izip_longest(headers, [c for i, c in enumerate(new_s) if i%2 != 0])}

输出:

{'surname': u'Mr', 'title': u'Bloggs', 'dob': u'13/05/12', 'gender': u'Male', 'forename': u'Joe', 'occupation': u''}