Question

我正在尝试抓取一个没有标准化输出并且没有任何样式/ id标签的旧网站，它们只是显示如下：

<table BORDER="0" VALIGN="top" CELLPADDING="3" CELLSPACING="0" WIDTH="100%">
	<tr>
		<td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Surname</strong>
		</td>
		<td valign="top">
Bloggs
		</td>
	</tr>
	<tr>
		<td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Forename(s)</strong>
		</td>
		<td valign="top">
Joe
		</td>
	</tr>
	<tr>
		<td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Title</strong>
		</td>
		<td valign="top">
Mr
		</td>
	</tr>
	<tr>
	    <td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Gender</strong>
		</td>
		<td valign="top">
Male
		</td>
	</tr>
	<tr>
		<td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Occupation</strong>
		</td>
		<td valign="top">

		</td>
	</tr>
	<tr>
		<td ALIGN="left" VALIGN="top" WIDTH="175">
			<strong>Date of Birth</strong>
		</td>
		<td valign="top">
13/05/12
		</td>
	</tr>
</table>

问题是，如果数据库中不存在某个字段，它甚至不显示空行，则会在两个核心数据表之间添加一些额外数据作为额外表，并且在发生这种情况时没有指示。 / p>

我使用Python的方法有点长，但想法是将左侧TD作为标题进行验证并抓住正确的TD，这是相关数据，如下所示：

title, forename, surname, gender, occupation, dob = '', '', '', '', '', ''

tbl1 = soup.findAll('table')[1]

for tr in tbl1.findAll('tr'):
    content = tr.findAll('td')
    if content[0].text.strip() == 'Title':
        title = content[1].text.strip()
    if content[0].text.strip() == 'Forename(s)':
        forename = content[1].text.strip()
    if content[0].text.strip() == 'Surname':
        surname = content[1].text.strip()
    if content[0].text.strip() == 'Gender':
        gender = content[1].text.strip()
    if content[0].text.strip() == 'Occupation':
        occupation = content[1].text.strip()
    if content[0].text.strip() == 'Date of Birth':
        dob = content[1].text.strip()

print('"' + title + '","' + forename + '","' + surname + '","' + gender + '","' + occupation + '","' + dob + '"')

每当我尝试迭代所有表格时，我得到： AttributeError：ResultSet对象没有属性'findAll'。您可能正在处理像单个项目的项目列表。当你打算调用find（）时，你调用了find_all（）吗？

Answer 1

您可以创建标题列表并使用itertools.izip_longest：

import itertools
import re
headers = ['title', 'forename', 'surname', 'gender', 'occupation', 'dob']
from bs4 import BeautifulSoup as soup 
s = soup(web_data, 'lxml')
new_s = [re.sub('\n+|\t+', '', i.text) for i in s.findAll('td')]
final_data = {a:b for a, b in itertools.izip_longest(headers, [c for i, c in enumerate(new_s) if i%2 != 0])}

输出：

{'surname': u'Mr', 'title': u'Bloggs', 'dob': u'13/05/12', 'gender': u'Male', 'forename': u'Joe', 'occupation': u''}

BeautifulSoup验证“title”td以提取多个表的值

1 个答案: