Question

我需要从HTML页面中提取表格中的数据。数据结构都是这样的：

<td class="def">
            <div><b>First Name:</b></div>
        </td>
        <td class="def">Jhon
        </td>

<td class="def">
            <div><b>Last Name:</b></div>
        </td>
        <td class="def">Smith
        </td>

我需要单独提取数据。例如

print first_name
>> Jhon
print last_name
>> Smith

一个简单的soup.find('td', {'class':'def'})将无效，因为它将匹配所有内容（名字：，Jhon，姓氏：，史密斯）。

关于如何查找特定数据的任何想法？发布了同样的问题here但是给出的解决方案根本不起作用......

Answer 1

这样怎么样：

>>> tds = soup.find_all('td', {'class':'def'})
>>> [td.find_next_sibling('td', {'class':'def'}).text.strip() \
...     for td in tds if "First Name:" in s.text]
... 
[u'Jhon']
>>> [td.find_next_sibling('td', {'class':'def'}).text.strip() \
...     for td in tds if "Last Name:" in s.text]
... 
[u'Smith']

Answer 2

试试这个

First Name:.*?<td class="def">([^\n]+).*?Last Name:.*?<td class="def">([^\n]+)

Regex demo

<强>解释
.：除了换行符sample之外的任何字符 *：零次或多次sample
?：一次或无sample
( … )：捕获小组sample
[^x]：一个不是x sample的字符 \：逃脱一个特殊字符sample
+：一个或多个sample

使用特定文本查找标记值（Beautiful Soup）

2 个答案: