Question

我是python的新手。我想将每个HTML标记存储到列表项中。

from bs4 import BeautifulSoup
text = """
 <body>
    <div class="product">
    <div class="x">orange</div>
    <div class="x">apple</div>
    <p> This is text </p>
    </div>
</body>"""
soup = BeautifulSoup(text)

y=[]
for i in (soup.find_all("div", class_="product")):
   y.append(i.get_text().encode("utf-8").strip())

从上面的代码中，y的长度为1，并将所有文本存储在列表的一个项目上。但是，有必要使用“div product”进行解析，并将html标记内的每个文本存储到不同的项目列表中。

所以y将是：

y =['orange', 'apple', 'This is text']

而不是：

 y=['orange\napple\n This is text']

Answer 1

如果您想要的只是直接包含的字符串，请不要使用text，并且只询问div.product标记中包含的元素：

for elem in soup.select("div.product *"):
    y.append(elem.string.strip().encode('utf8'))

演示：

>>> y = []
>>> for elem in soup.select("div.product *"):
...     y.append(elem.string.strip().encode('utf8'))
... 
>>> y
['orange', 'apple', 'This is text']

Answer 2

soup.find_all("div",class_="product")

给出所有带有类产品的div标签，所以你现在有一个列表。因此，当你运行for循环时，它只迭代一次并在div标签中给出完整的文本。

因此，要获得每个孩子的数据，请使用类似这样的内容

for child in soup.find_all("div", class_="product").findChildren():
         y.append(child.string.strip().encode('utf8'))

使用python解析HTML标记

2 个答案: