Question

我有一些代码可以使用BS4从HTML文件中提取数据对：

from bs4 import BeautifulSoup
readfile = """
<html>
  <head>
    <meta name="generator"
    <title></title>
  </head>
  <body>

    <table align="center" border="1" cellpadding="0" cellspacing="1" width="650">
  <tr>
    <td>
    <font size="1"> Title1</font>
    <br /> </td>
    <td>
    <font size="1"> TItle2 type</font>
    <br /> </td>
    <td>
    <font size="1"> Title3</font>
    <br /> 
    <font size="2">value1</font></td>
    <td>
    <font size="1"> Title4 ID</font>
    <br /> 
    <font size="2">value2</font></td>
  </tr>
 """

soup = BeautifulSoup(readfile, "html.parser")
 tables = soup.findChildren('table')

for title in soup.find_all("font", {"size": "1"}):
    value = title.find_next_sibling("font", {"size": "2"})
    print (title.text, ":", value.text if value else "No Value")

假设我总共有30行。我只想要4个值对，所以我可以将它们插入到rdbms中。

我应该尝试使用大小的列表：我想要的1个值得到大小：2值？已经查看了BS4上的一些示例，但它并没有下沉。感谢

Answer 1

如果您想将前四对进入RDBMS，那么给出一个计数变量和条件就足够了，如下所示。

from bs4 import BeautifulSoup
readfile = open("html.parser",'r')
soup = BeautifulSoup(readfile)

tables = soup.findChildren('table')

count = 0 
for i in soup.find_all("font", {"size": "1"}):
    value = i.find_next_sibling("font", {"size": "2"})
    if value is not None and count < 4:
       print (i.text, ":", value.text if value else "No Value")
       count = count + 1

希望这有帮助。

如何从HTML文件中提取特定数据？

1 个答案: