这是我的代码,用于抓取并解析wordinastence.com中的必要信息,该信息为给定的单词提供了有用的上下文句子:
#first import request to crawl the html from the target page
#this case the website is http://www,wordinasentence.com
import requests
target = input("The word you want to search : ")
res = requests.get("https://wordsinasentence.com/"+ target+"-in-a-sentence/")
#further, put this in so that res_process malfunction could flag the errors
try:
res.raise_for_status()
except Exception as e:
print("There's a problem while connecting to a wordsinasentence sever:", e)
#it's a unreadable information, so that we needs to parse it to make it readable.
## use the beautifulsoup to make it readable
import bs4
html_soup = bs4.BeautifulSoup(res.text, 'html.parser')
#check it has been well parsed
#now we'll extract the Defintion of target
keywords = html_soup.select('Definition')
如果我运行给定的方法select(' Defintion'),它会一直返回空列表,即使以下打印出 html_soup 变量:
<p onclick='responsiveVoice.speak("not done for any particular reason; chosen or done at random");' style="font-weight: bold; font-family:Arial; font-size:20px; color:#504A4B;padding-bottom:0px;">Definition of Arbitrary</p>
[]
可能出现什么问题?
答案 0 :(得分:0)
问题是你使用了错误的方法来查找文本(select()
用于css选择器)。您可以使用keyword string
和find_all
以及一个功能来选择您要查找的标记。
def has_text_def(s):
return s and s.startswith('Definition of')
definitions = soup.find_all('p', string=has_text_def)
顺便说一下,您需要让next element in the tree (with next_sibling
)访问定义:
for p in definitions:
print(p.next_sibling.next_sibling.text)