Question

我试图废弃一个网站，我需要将HTML代码减少一半。问题是HTML代码组织得不是很好，我不能只使用findAll。

以下是我解析HTML代码的代码：

resultats = requests.get(URL)
bs = BeautifulSoup(resultats.text, 'html.parser')

我想要做的是为每个bs划分<h2>我发现：

解决方案可能非常简单，但我无法找到它......

编辑：网站here

Answer 1

这会在没有html的情况下删除整个文本：

import urllib2, json, re
from bs4 import BeautifulSoup

url = "https://fr.wikipedia.org/wiki/Liste_de_sondages_sur_l'%C3%A9lection_pr%C3%A9sidentielle_fran%C3%A7aise_de_2017#Avril"
resultats = urllib2.urlopen(url)
html = resultats.read()

soup = BeautifulSoup(html, 'html5lib')
soup = soup.get_text() # Extracts Text from HTML

print soup

如果您想要保留某些信息，可以添加以下信息：

soup = re.sub(re.compile('yourRegex', re.DOTALL), '', soup)\
       .strip()

使用python beautifulsoup将html切成两半

1 个答案: