Question

不确定我忽略了什么，但是我有一个相对简单的问题。

我正在抓取一个页面，其中包含几个我这样称呼的文章标签（简化版）：

soup = BeautifulSoup(page, 'lxml')
articles = soup.find_all("article", "product-tile promotion")

for a in articles:
   print(a)

# This nicely prints all of my article tags and it's inner html, so up to here all is ok

str = ''.join(articles)

# Here things obviously go wrong, as I am trying to converse a bs4 tag to a string, and that's not supported...

file_name = 'list.xml'
complete_name = os.path.join(user_path, file_name)
file_object = codecs.open(complete_name, "w", "utf-8")
file_object.write(str)

我可以打印文章，并且可以准确显示我的需求。但是，当我想将所有这些文章写成一个字符串时，我陷入了困境，因为我想让完整的内部HTML与找到的普通纯文本解决方案相对。

所以我的实际问题是，如何使标签保持原样（不仅是文本，还包括要查找的所有元素和属性），以便可以将其另存为xml？

Answer 1

如果我的理解很好，您想在xml文件中打印所有标签articles，而不只是文本，对吗？

在这种情况下，您可以尝试此操作，首先将文章保存在列表中，然后使用str强制循环打印每个元素：

soup = BeautifulSoup(page, 'lxml')
articles = soup.find_all("article", "product-tile promotion")

articles_list = []
for a in articles:
    articles_list.append(a)
    #print(a)

# This nicely prints all of my article tags and it's inner html, so up to here all is ok

#str = ''.join(articles)

# Here things obviously go wrong, as I am trying to converse a bs4 tag to a string, and that's not supported...

file_name = 'list.xml'
complete_name = os.path.join(user_path, file_name)
file_object = codecs.open(complete_name, "w", "utf-8")
for al in articles_list:
    file_object.write(str(al))

编辑： 或者可以简单地使用第一个列表：

soup = BeautifulSoup(page, 'lxml')
articles = soup.find_all("article", "product-tile promotion")

file_name = 'list.xml'
complete_name = os.path.join(user_path, file_name)
file_object = codecs.open(complete_name, "w", "utf-8")
for a in articles:
    file_object.write(str(a))

Answer 2

find_all返回一个bs4.element.Tag元素列表，而不是字符串列表。您可以将每个元素转换为字符串。

尝试更换

for a in articles:
   print(a)

使用

for i in range(len(articles)):
   articles[i] = str(articles[i])
   print(articles[i])

如何将美丽的汤标签内部html存储为字符串

2 个答案: