我有这样的跨度块:
<span class="selectable-text invisible-space copyable-text" dir="ltr">
some text
<img alt="" class="b61 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -20px -20px;"/>
more some text
<img alt="" class="b62 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -40px -40px;"/>
blah-blah-blah
<img alt="" class="b76 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: 0px -20px;"/>
</span>
soup.find('span', {'class': 'selectable-text invisible-space copyable-text'}).get_text()
这段代码只给我文本。
我想到的一切
span = soup.select('span', {'class': 'selectable-text invisible-space copyable-text'})
for item in span:
if re.match('.*emoji', str(item)):
...
现在我有这样的字符串:
<span class="selectable-text invisible-space copyable-text" dir="ltr">some text <img alt="?" class="b61 emoji wa selectable-text invisible-space copyable-text" data-plain-text="?" src="URL" style="background-position: -20px -20px;"/>more some text<img alt="?" class="b62 emoji wa selectable-text invisible-space copyable-text" data-plain-text="?" src="URL" style="background-position: -40px -40px;"/> blah-blah-blah <img alt="?" class="b76 emoji wa selectable-text invisible-space copyable-text" data-plain-text="?" src="URL" style="background-position: 0px -20px;"/></span>
在我看来,下一步是使用正则表达式获取我需要的元素。
还有其他方法来获取类似这样的字符串吗?
some text <emoji> more some text <emoji> blah-blah-blah <emoji>
答案 0 :(得分:0)
如果要将文本和img提取到一个跨度中,则下面的代码应该可以使用。
from bs4 import BeautifulSoup as bs
stra = """
<span class="selectable-text invisible-space copyable-text" dir="ltr">
some text
<img alt="" class="b61 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -20px -20px;"/>
more some text
<img alt="" class="b62 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -40px -40px;"/>
blah-blah-blah
<img alt="" class="b76 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: 0px -20px;"/>
</span>
"""
soup = bs(stra, 'html.parser')
ch = list(soup.find('span', {'class': 'selectable-text invisible-space copyable-text'}).children)
for i in zip(ch[::2], ch[1::2]):
print('<span>{}{}</span>'.format(*i))
输出:
<span>
some text
<img alt="" class="b61 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -20px -20px;"/>
</span>
<span>
more some text
<img alt="" class="b62 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -40px -40px;"/>
</span>
<span>
blah-blah-blah
<img alt="" class="b76 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: 0px -20px;"/>
</span>
答案 1 :(得分:0)
好像您需要.replaceWith
。
例如:
from bs4 import BeautifulSoup
html = """<span class="selectable-text invisible-space copyable-text" dir="ltr">
some text
<img alt="" class="b61 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -20px -20px;"/>
more some text
<img alt="" class="b62 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -40px -40px;"/>
blah-blah-blah
<img alt="" class="b76 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: 0px -20px;"/>
</span>"""
soup = BeautifulSoup(html, "html.parser")
for span in soup.findAll('span', {'class': 'selectable-text invisible-space copyable-text'}):
for img in span.findAll("img"):
img.replaceWith(r"<emoji>")
print(soup.prettify(formatter=None))
输出:
<span class="selectable-text invisible-space copyable-text" dir="ltr">
some text
<emoji>
more some text
<emoji>
blah-blah-blah
<emoji>
</span>
答案 2 :(得分:0)
在Span
标记内查找子级,然后使用previous_element
(它是文本值)。
from bs4 import BeautifulSoup
data='''<span class="selectable-text invisible-space copyable-text" dir="ltr">
some text
<img alt="" class="b61 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -20px -20px;"/>
more some text
<img alt="" class="b62 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -40px -40px;"/>
blah-blah-blah
<img alt="" class="b76 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: 0px -20px;"/>
</span>'''
soup=BeautifulSoup(data,'html.parser')
itemtag=soup.find('span', class_='selectable-text invisible-space copyable-text')
children = itemtag.findChildren()
items=[]
for child in children:
items.append(child.previous_element.replace('\n','').strip())
items.append(child)
print(items)
输出:
['some text', <img alt="" class="b61 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -20px -20px;"/>, 'more some text', <img alt="" class="b62 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: -40px -40px;"/>, 'blah-blah-blah', <img alt="" class="b76 emoji wa selectable-text invisible-space copyable-text" data-plain-text="" src="URL" style="background-position: 0px -20px;"/>]