提取不属于内部分区的文本

时间:2018-01-18 16:07:44

标签: python beautifulsoup

我有这段代码提取太多文字。 我试图从顶级内容中仅提取标题。

from bs4 import BeautifulSoup
import requests
r  = requests.get("https://education.maharashtra.gov.in/saral/27230500360")
data = r.text
soup = BeautifulSoup(data)
soup.find("div", {"class": "top-content"})

如何提取不属于内部div的学校名称? 预期产出:

BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE (27230500360) 

更新

是否可以将文本另存为dict?

{27230500360 : "BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE"} 

3 个答案:

答案 0 :(得分:2)

试试这个。它会带你去那里:

from bs4 import BeautifulSoup
import requests

req  = requests.get("https://education.maharashtra.gov.in/saral/27230500360")
soup = BeautifulSoup(req.text,"lxml")
for item in soup.select("#logo"):
    data = ' '.join(item.text.split())
    item_dict = {data.split(" ")[-1]:' '.join(data.split(" ")[:-1])}
    print(item_dict)

输出:

{'(27230500360)': 'BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE'}

答案 1 :(得分:1)

您想要的文字位于logo id

的div中
text = soup.select('#logo')[0].text
print(text.strip())

输出

  

BHARATI VIDYAMANDIR HINDI夜校和JR学院

答案 2 :(得分:1)

要获得学校名称,您可以这样做

>>> text = soup.find('div', {'id': 'logo'}).text.strip()
>>> text
'BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE                                                                                                                                                                                                (27230500360)'

正如您所看到的,BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE(27230500360)之间存在大量空白。要删除它,您可以使用正则表达式。

>>> text = re.sub(' +', ' ', text)
>>> text
'BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE (27230500360)'

简而言之,

>>> re.sub(' +', ' ', soup.find('div', {'id': 'logo'}).text.strip())
'BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE (27230500360)'