如何将美丽的汤标签内部html存储为字符串

时间:2019-03-31 21:15:43

标签: python beautifulsoup

不确定我忽略了什么,但是我有一个相对简单的问题。

我正在抓取一个页面,其中包含几个我这样称呼的文章标签(简化版):

soup = BeautifulSoup(page, 'lxml')
articles = soup.find_all("article", "product-tile promotion")

for a in articles:
   print(a)

# This nicely prints all of my article tags and it's inner html, so up to here all is ok

str = ''.join(articles)

# Here things obviously go wrong, as I am trying to converse a bs4 tag to a string, and that's not supported...

file_name = 'list.xml'
complete_name = os.path.join(user_path, file_name)
file_object = codecs.open(complete_name, "w", "utf-8")
file_object.write(str)

我可以打印文章,并且可以准确显示我的需求。 但是,当我想将所有这些文章写成一个字符串时,我陷入了困境,因为我想让完整的内部HTML与找到的普通纯文本解决方案相对。

所以我的实际问题是,如何使标签保持原样(不仅是文本,还包括要查找的所有元素和属性),以便可以将其另存为xml?

2 个答案:

答案 0 :(得分:1)

如果我的理解很好,您想在xml文件中打印所有标签articles,而不只是文本,对吗?

在这种情况下,您可以尝试此操作,首先将文章保存在列表中,然后使用str强制循环打印每个元素:

soup = BeautifulSoup(page, 'lxml')
articles = soup.find_all("article", "product-tile promotion")

articles_list = []
for a in articles:
    articles_list.append(a)
    #print(a)

# This nicely prints all of my article tags and it's inner html, so up to here all is ok

#str = ''.join(articles)

# Here things obviously go wrong, as I am trying to converse a bs4 tag to a string, and that's not supported...

file_name = 'list.xml'
complete_name = os.path.join(user_path, file_name)
file_object = codecs.open(complete_name, "w", "utf-8")
for al in articles_list:
    file_object.write(str(al))

编辑: 或者可以简单地使用第一个列表:

soup = BeautifulSoup(page, 'lxml')
articles = soup.find_all("article", "product-tile promotion")

file_name = 'list.xml'
complete_name = os.path.join(user_path, file_name)
file_object = codecs.open(complete_name, "w", "utf-8")
for a in articles:
    file_object.write(str(a))

答案 1 :(得分:0)

find_all返回一个bs4.element.Tag元素列表,而不是字符串列表。您可以将每个元素转换为字符串。

尝试更换

for a in articles:
   print(a)

使用

for i in range(len(articles)):
   articles[i] = str(articles[i])
   print(articles[i])