在python中提取Span标记的内容

时间:2017-11-29 09:10:02

标签: python

我想从

中提取 1包,4包礼品套装,1支带橡皮的铅笔......
[<span class="a-size-base">1 Pack</span>, <span class="a-size-base">4 Pack Gift Set</span>, <span class="a-size-base">1 Pencil with Erasers</span>, <span class="a-size-base">1 Pencil with Lead and Erasers</span>] 

在python中。

谢谢

3 个答案:

答案 0 :(得分:0)

最简单的方法是使用Beautiful Soup,事实上的 Python库来解析HTML。获取by downloading the source herepip install bs4

from bs4 import BeautifulSoup

string = '[<span class="a-size-base">1 Pack</span>, <span class="a-size-base">4 Pack Gift Set</span>, <span class="a-size-base">1 Pencil with Erasers</span>, <span class="a-size-base">1 Pencil with Lead and Erasers</span>]'

# Represent the string as a nested data structure
soup = BeautifulSoup(string, "html.parser")
# Find all <span> tags in the BeautifulSoup object
spans = soup.find_all('span')
# Get the text inside the <span> tags
print([span.text for span in spans])

这将为您提供所需内容的列表:

['1 Pack', '4 Pack Gift Set', '1 Pencil with Erasers', '1 Pencil with Lead and Erasers']

答案 1 :(得分:0)

使用标准库re(正则表达式操作)。

for (Long id : ((Map< Long, ?>)mSomeMap).keySet())

输出为:1个装,4个礼品套装,1个带橡皮的铅笔,1个带铅和橡皮的铅笔

答案 2 :(得分:0)

您能详细说明您的问题和数据结构吗?假设您的数据结构是字符串列表:

import re
l = ['<span class="a-size-base">1 Pack</span>', '<span class="a-size-base">4 Pack Gift Set</span>', '<span class="a-size-base">1 Pencil with Erasers</span>', '<span class="a-size-base">1 Pencil with Lead and Erasers</span>']
print([re.match(r'<([a-zA-Z]+).+>(.+)</\1>', i).group(2) for i in l])