不确定我忽略了什么,但是我有一个相对简单的问题。
我正在抓取一个页面,其中包含几个我这样称呼的文章标签(简化版):
soup = BeautifulSoup(page, 'lxml')
articles = soup.find_all("article", "product-tile promotion")
for a in articles:
print(a)
# This nicely prints all of my article tags and it's inner html, so up to here all is ok
str = ''.join(articles)
# Here things obviously go wrong, as I am trying to converse a bs4 tag to a string, and that's not supported...
file_name = 'list.xml'
complete_name = os.path.join(user_path, file_name)
file_object = codecs.open(complete_name, "w", "utf-8")
file_object.write(str)
我可以打印文章,并且可以准确显示我的需求。 但是,当我想将所有这些文章写成一个字符串时,我陷入了困境,因为我想让完整的内部HTML与找到的普通纯文本解决方案相对。
所以我的实际问题是,如何使标签保持原样(不仅是文本,还包括要查找的所有元素和属性),以便可以将其另存为xml?
答案 0 :(得分:1)
如果我的理解很好,您想在xml文件中打印所有标签articles
,而不只是文本,对吗?
在这种情况下,您可以尝试此操作,首先将文章保存在列表中,然后使用str强制循环打印每个元素:
soup = BeautifulSoup(page, 'lxml')
articles = soup.find_all("article", "product-tile promotion")
articles_list = []
for a in articles:
articles_list.append(a)
#print(a)
# This nicely prints all of my article tags and it's inner html, so up to here all is ok
#str = ''.join(articles)
# Here things obviously go wrong, as I am trying to converse a bs4 tag to a string, and that's not supported...
file_name = 'list.xml'
complete_name = os.path.join(user_path, file_name)
file_object = codecs.open(complete_name, "w", "utf-8")
for al in articles_list:
file_object.write(str(al))
编辑: 或者可以简单地使用第一个列表:
soup = BeautifulSoup(page, 'lxml')
articles = soup.find_all("article", "product-tile promotion")
file_name = 'list.xml'
complete_name = os.path.join(user_path, file_name)
file_object = codecs.open(complete_name, "w", "utf-8")
for a in articles:
file_object.write(str(a))
答案 1 :(得分:0)
find_all
返回一个bs4.element.Tag
元素列表,而不是字符串列表。您可以将每个元素转换为字符串。
尝试更换
for a in articles:
print(a)
使用
for i in range(len(articles)):
articles[i] = str(articles[i])
print(articles[i])