我正在抓取与BeautifulSoap的一些链接。
这是我要删除的URL的源代码的相关部分:
<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>
这是我的BeautifulSoap代码(仅相关部分),用于获取description
标签内的文本:
quote_page = sys.argv[1]
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
description_box = soup.find('div', {'class':'description'})
description = description_box.get_text(separator=" ").strip()
print description
使用 python script.py https://example.com/page/2000 运行脚本会给出以下输出:
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
如何将换行符替换为句点和空格,使其看起来像以下内容:
Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
有什么想法可以做到吗?
答案 0 :(得分:1)
来自here:
html = '''<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>'''
n = 2 # occurrence i.e. 2nd in this case
sep = '\n' # sep i.e. newline
cells = html.split(sep)
from bs4 import BeautifulSoup
html = sep.join(cells[:n]) + ". " + sep.join(cells[n:])
soup = BeautifulSoup(html, 'html.parser')
title_box = soup.find('div', attrs={'class': 'description'})
title = title_box.get_text().strip()
print (title)
输出:
Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
编辑:
from bs4 import BeautifulSoup
page = requests.get("https://blablabla.com")
soup = BeautifulSoup(page.content, 'html.parser')
description_box = soup.find('div', attrs={'class': 'description'})
description = description_box.get_text().strip()
n = 2 # occurrence i.e. 2nd in this case
sep = '\n' # sep i.e. newline
cells = description.split(sep)
desired = sep.join(cells[:n]) + ". " + sep.join(cells[n:])
print (desired)
答案 1 :(得分:0)
尝试一下
description = description_box.get_text(separator=" ").rstrip("\n")
答案 2 :(得分:0)
分割线,然后在进行解析之前加入。
from bs4 import BeautifulSoup
htmldata='''<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>'''
htmldata="".join(item.strip() for item in htmldata.split("\n"))
soup=BeautifulSoup(htmldata,'html.parser')
description_box = soup.find('div', class_='description')
print(description_box.text)
输出:
Planet Nine was initially proposed to explain the clustering of orbitsOf Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
已编辑
:import requests
from bs4 import BeautifulSoup
htmldata=requests.get("url here").text
htmldata="".join(item.strip() for item in htmldata.split("\n"))
soup=BeautifulSoup(htmldata,'html.parser')
description_box = soup.find('div', class_='description')
print(description_box.text.strip())
答案 3 :(得分:0)
使用拆分并与选择一起加入
from bs4 import BeautifulSoup as bs
html = '''
<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>
'''
soup = bs(html, 'lxml')
text = ' '.join(soup.select_one('.description').text.split('\n'))
print(text)