Question

我想从this site获取示例用法句子。

以下是该页面的HTML源代码：

<vcom:examples lang="en" word="creep" count="4" filter="0" class="vcom_examples">

<div class="exampleBrowser hasNext">

<div class="domains">
<a href="javascript:void(0)" title="All Sources" class="selected">All Sources</a><a href="javascript:void(0)" title="Fiction" data-code="F">Fiction</a><a href="javascript:void(0)" title="Arts / Culture" data-code="A">Arts / Culture</a><a href="javascript:void(0)" title="News" data-code="N">News</a><a href="javascript:void(0)" title="Business" data-code="B">Business</a><a href="javascript:void(0)" title="Sports" data-code="S">Sports</a><a href="javascript:void(0)" title="Science / Med" data-code="M">Science / Med</a><a href="javascript:void(0)" title="Technology" data-code="T">Technology</a></div>

<div class="container" style="height: auto;">
<div class="results" style="left: 0px;">

<ul>
<li><div class="sentence">
If you believe their campaigns, it’s the choice between a <strong>creep</strong> and a crook.</div>
<a target="_blank" class="source" href="https://www.theguardian.com/us-news/2016/nov/22/journalists-media-election-2016-donald-trump">
<span class="corpus">The Guardian</span>
<span class="date">Nov 22, 2016</span></a>
</li>

<li>
<div class="sentence">
From stingrays to spy planes, we are seeing the consequences of powerful surveillance technology <strong>creeping</strong> into local law enforcement without adequate limits.
</div>
<a target="_blank" class="source" href="http://www.slate.com/articles/technology/future_tense/2016/11/should_police_bodycams_come_with_facial_recognition_software.html">
<span class="corpus">Slate</span>
<span class="date">Nov 22, 2016</span></a>
</li>

</ul></div></div>

<div class="buttons"><a class="prev ss-navigateleft" title="prev">Prev</a><a class="next ss-navigateright right" title="next">Next</a></div></div></vcom:examples>

这是我的Python代码：

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('https://www.vocabulary.com/dictionary/creep').read()
soup = bs.BeautifulSoup(sauce,'lxml')

for examples in soup.find_all('p',class_ = 'sentence'):
    print(examples.text)

以与上述相同的方式成功地抓取了词语的含义。但是，当我试图以这种方式刮掉例句时，它什么也没有返回。

为什么它没有返回例句？

Answer 1

您无法使用beautifulsoup4获取该句子，因为稍后会从json加载示例。在html响应中，该部分看起来像这样：

<vcom:examples lang="en" word="creep" count="4" filter="0" ></vcom:examples>

稍后由https://cdn.vocab.com/js/module-esekdz.js的代码填充。

您可以使用此最小片段直接从json获取示例句子，而不是解析整个html体，只需选择一个单词，域和最大结果：

import requests

search_word = 'creep'
# "Fiction", "Arts / Culture", "News", "Business", "Sports", "Science / Med", "Technology"
domains = [None, "F", "A", "N", "B", "S", "M", "T"]

link = "https://corpus.vocabulary.com/api/1.0/examples.json"

response = requests.get(link, params={'query': search_word, 'domain': domains[0], 'maxResults': 24})

if response.ok:
    for example in response.json()['result']['sentences']:
        print(example['sentence'])

Answer 2

经过urllib.request.urlopen（或requests.get）和bs4的大量试验，我也未能获得sentence课程。似乎这两种方法都不能从网页上看到整个内容。在这种情况下，我担心您将不得不使用其他一些软件包，例如selenium.webdriver。代码如下：

from selenium import webdriver

chrome_driver_path = 'your_working_directory\\chromedriver.exe'
mydriver = webdriver.Chrome(chrome_driver_path)

mydriver.get('https://www.vocabulary.com/dictionary/creep')
sentences = mydriver.find_elements_by_class_name('sentence')

for sentence in sentences:
    print(sentence.text)

这是一个非常明确且简洁的video tutorial，关于如何使用＆＃34; Selenium / ChromeDriver＆＃34; 进行网页抓取。

字典web-scraper不会返回例句

2 个答案: