Question

我尝试构建网络抓取工具。我的抓取工具必须找到与所选标签对应的所有行，并以与原始HTML相同的顺序将它们保存到新的file.md文件中。

标记在数组中指定：

list_of_tags_you_want_to_scrape = ['h1', 'h2', 'h3', 'p', 'li']

然后这只会给我指定标签内的内容：

soup_each_html = BeautifulSoup(particular_page_content, "html.parser")
inner_content = soup_each_html.find("article", "container")

让我们说这是结果：

<article class="container">
  <h1>this is headline 1</h1>
  <p>this is paragraph</p>
  <h2>this is headline 2</h2>
  <a href="bla.html">this won't be shown bcs 'a' tag is not in the array</a>
</article>

然后我有一个方法，如果内容中存在数组中的标记

，则该方法负责将行写入file.md文件中

with open("file.md", 'a+') as f:
    for tag in list_of_tags_you_want_to_scrape:
        inner_content_tag = inner_content.find_all(tag)

        for x in inner_content_tag:
            f.write(str(x))
            f.write("\n")

，确实如此。但是问题是，它遍历了数组（每个数组），并将所有<h1>首先保存，将所有<h2>保存在第二位，依此类推，这是因为这是{中指定的顺序{1}}数组。

这将是结果：

list_of_tags_you_want_to_scrape

所以我想像原始HTML一样按正确的顺序排列它们。在第一个<article class="container"> <h1>this is headline 1</h1> <h2>this is headline 2</h2> <p>this is paragraph</p> </article>之后应该是<h1>元素。

这意味着我可能还需要对<p>进行每个循环，并检查该inner_content中的每一行是否至少等于数组中的标记之一。如果是，则保存然后移至另一行。我试图做到这一点，并让inner_content逐行获取，但它给了我一个错误，我不确定该怎么做才是正确的方法。（第一天使用BeautifulSoup模块）

任何提示或建议如何修改我的方法以实现这一目标？谢谢！

Answer 1

要保持html输入的原始顺序，可以使用递归遍历soup.contents属性：

from bs4 import BeautifulSoup as soup
def parse(content, to_scrape = ['h1', 'h2', 'h3', 'p', 'li']):
   if content.name in to_scrape:
      yield content
   for i in getattr(content, 'contents', []):
      yield from parse(i)

示例：

html = """   
<html>
  <body>
      <h1>My website</h1>
      <p>This is my first site</p>
      <h2>See a listing of my interests below</h2>
      <ul>
         <li>programming</li>
         <li>math</li>
         <li>physics</li>
      </ul>
      <h3>Thanks for visiting!</h3>
  </body>
</html>
"""

result = list(parse(soup(html, 'html.parser')))

输出：

[<h1>My website</h1>, <p>This is my first site</p>, <h2>See a listing of my interests below</h2>, <li>programming</li>, <li>math</li>, <li>physics</li>, <h3>Thanks for visiting!</h3>]

如您所见，html的原始顺序得以保留，现在可以写入文件中：

with open('file.md', 'w') as f:
   f.write('\n'.join(map(str, result)))

每个bs4对象包含一个name和contents属性，等等。 name属性是标记名称本身，而contents属性存储所有子HTML。 parse使用generator首先检查所传递的bs4对象是否具有属于to_scrape列表的标签，如果是，则yield是该值。最后，parse遍历content的内容，并在每个元素上进行调用。

抓取内容中的标签必须与原始HTML文件中的标签具有相同的顺序

1 个答案: