BS4选择()方法

时间:2017-11-09 06:49:15

标签: python

这是我的代码,用于抓取并解析wordinastence.com中的必要信息,该信息为给定的单词提供了有用的上下文句子:

#first import request to crawl the html from the target page
#this case the website is http://www,wordinasentence.com

import requests

target = input("The word you want to search : ")

res = requests.get("https://wordsinasentence.com/"+ target+"-in-a-sentence/")

#further, put this in so that res_process malfunction could flag the errors
try:
    res.raise_for_status()
except Exception as e:
    print("There's a problem while connecting to a wordsinasentence sever:", e)

#it's a unreadable information, so that we needs to parse it to make it readable.
## use the beautifulsoup to make it readable

import bs4
html_soup = bs4.BeautifulSoup(res.text, 'html.parser')

#check it has been well parsed
#now we'll extract the Defintion of target

keywords = html_soup.select('Definition')

如果我运行给定的方法select(' Defintion'),它会一直返回空列表,即使以下打印出 html_soup 变量:

<p onclick='responsiveVoice.speak("not done for any particular reason; chosen or done at random");' style="font-weight: bold; font-family:Arial; font-size:20px; color:#504A4B;padding-bottom:0px;">Definition of Arbitrary</p>

[]

可能出现什么问题?

1 个答案:

答案 0 :(得分:0)

问题是你使用了错误的方法来查找文本(select()用于css选择器)。您可以使用keyword stringfind_all以及一个功能来选择您要查找的标记。

def has_text_def(s):    
    return s and s.startswith('Definition of')

definitions = soup.find_all('p', string=has_text_def)

顺便说一下,您需要让next element in the tree (with next_sibling)访问定义:

for p in definitions:
    print(p.next_sibling.next_sibling.text)