Question

我要剪贴和html页面，但被 标记阻止。我试图基于 作为分隔符分割html内容。

from urllib.request import urlopen
import re
from bs4 import BeautifulSoup

url = 'https://www.ouedkniss.com/telephones'

html = urlopen(url)
bs = BeautifulSoup(html, 'html.parser')

text_tag = bs.find('span', class_="annonce_get_description", 
itemprop="description")

words = text_tag.text.split('<br/>')
print(words)

正如您在结果中看到的那样，当我分割文本时，它什么也没有发生？???

['Téléphones便携式存储器：128 GO Produit neuf jamaisutilisé ▶️iphone6s 16go avec Chargeur原始\ r \ n.kitmains，blanc.gold .étatneuf。 libéré官方：33000da \ r \ n \ r \ n▶️iphone6s 32go avec Chargeur原始\ r \ n.kitmains，blanc.gold.étatneuf。解放官方：35000da \ r \ n \ r \ n▶️iphone6s 128go avec Chargeur原装 \ r \ n.kitmains，blanc.gold.ét']

Answer 1

当您获得std::pow时，BS会删除所有标签，因此没有.text要拆分。

您可以尝试 来获取它，它应该在来自不同标签的文本之间添加分隔符。应该放置.get_text(separator=...)而不是separator。然后您可以使用 

split(separator)

或在文本中使用words = text_tag.get_text(separator='|', strip=True).split('|')时使用更独特的分隔符

'|'

但是它可以将words = text_tag.get_text(separator='|br|', strip=True).split('|br|')替换为separator中的之类的其他标签

您可以在原始HTML中将所有'Mémoire : 64 GO'替换为 ，然后使用separator

split(separator)

您只能使用内部HTML

获取内部html作为一个字符串（字节），
将from urllib.request import urlopen from bs4 import BeautifulSoup url = 'https://www.ouedkniss.com/telephones' html = urlopen(url) html = html.read() html = html.replace(b' ', b'|br|') bs = BeautifulSoup(html, 'html.parser') text_tag = bs.find('span', class_="annonce_get_description", itemprop="description") words = text_tag.text.split('|br|') print(words)替换为' '，
再次解析，
获取文本（已经没有separator了）
find()

代码：

split(separator)

Answer 2

 是一个html标签，当您使用text_tag.text时，您只使用了文本，而不是带有标签的html部分

如果您想获取信息，可以进行更多探索：

print(text_tag.contents)
# output:
# ['Smartphones',
# <br/>,
# <b>Double puces</b>,
# <br/>,
# 'Mémoire : 128 GO ',
# <br/>,
# 'Bluetooth Wifi ',
# <b>4G</b>,
# ' ',
# <br/>,
# 'Ecran 5.99 pouces ',
# <br/>,
# 'Etat neuf / Sous emballage ',
# <br/>,
# <span class="annonce_description_preview ">Le smartphone et comme neuf utilisé pour quelque heures. fourni avec incassables original ! merci </span>]

您也可以尝试：

print(''.join(str(e) for e in text_tag.contents).split('<br/>'))
#output:
# ['Smartphones',
# '<b>Double puces</b>',
# 'Mémoire : 128 GO ',
# 'Bluetooth Wifi <b>4G</b> ',
# 'Ecran 5.99 pouces ',
# 'Etat neuf / Sous emballage ',
# '<span class="annonce_description_preview ">Le smartphone et comme neuf utilisé pour quelque heures. fourni avec incassables original ! merci </span>']

或者如果您想要一种更好的方法：

content = ['']

for item in text_tag.contents:
    if hasattr(item, 'text'):
        text = item.text
    else:
        text = str(item)

    if '<br/>' in str(item):
        content.append(text.strip())
    else:
        content[-1] = f'{content[-1]} {text.strip()}'.strip()

print(content)
# output
# ['Smartphones',
# 'Double puces',
# 'Mémoire : 128 GO',
# 'Bluetooth Wifi  4G',
# 'Ecran 5.99 pouces',
# 'Etat neuf / Sous emballage',
# 'Le smartphone et comme neuf utilisé pour quelque heures. fourni avec incassables original ! merci']

如何用br标签分割html文本

2 个答案: