与BeautifulSoup.find混淆?

时间:2019-10-23 19:04:58

标签: python html web-scraping beautifulsoup

我正试图抓住在特定律师事务所就读的大学律师,但是我不确定如何抓住此链接中列出的两所大学:https://www.wlrk.com/attorney/hahn/。从第一个链接的图像中可以看出,该律师参加的两所大学都在两个单独的“ li”标签下。

当我运行以下代码时,我只会将html一直保留到第一个“ li”标签的末尾(如第二个链接的图像所示),而不是第二个li部分,因此,我只会得到第一所大学”卡尔顿学院:“

Traceback (most recent call last):
  File "main.py", line 78, in <module>
    model.load("model.tflearn")
  File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\models\dnn.py", line 308, in load
    self.trainer.restore(model_file, weights_only, **optargs)
  File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\helpers\trainer.py", line 490, in restore
    self.restorer.restore(self.session, model_file)
  File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tensorflow\python\training\saver.py", line 1278, in restore
    compat.as_text(save_path))
ValueError: The passed save_path is not a valid checkpoint: C:\Users\User\model.tflearn

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 80, in <module>
    model.fit(training, output, n_epoch=1000, batch_size=8, show_metric=True)
  File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\models\dnn.py", line 216, in fit
    callbacks=callbacks)
  File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\helpers\trainer.py", line 339, in fit
    show_metric)
  File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\helpers\trainer.py", line 816, in _train
    tflearn.is_training(True, session=self.session)
  File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\config.py", line 95, in is_training
    tf.get_collection('is_training_ops')[0].eval(session=session)
  File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tensorflow\python\framework\ops.py", line 731, in eval
    return _eval_using_default_session(self, feed_dict, self.graph, session)
  File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tensorflow\python\framework\ops.py", line 5579, in _eval_using_default_session
    return session.run(tensors, feed_dict)
  File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
    run_metadata_ptr)
  File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tensorflow\python\client\session.py", line 1096, in _run
    raise RuntimeError('Attempted to use a closed Session.')
RuntimeError: Attempted to use a closed Session.

html code snippet output

2 个答案:

答案 0 :(得分:0)

bs仅获取第一个li元素。我不知道为什么。如果您想尝试使用lxml,这是一种方法,

import lxml
from lxml import html


url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})

tree = html.fromstring(res.content)
education = tree.xpath("//div[@class='attorney--education']//li/a/text()")

print(education)

输出:

  

[“卡尔顿学院”,“纽约大学法学院”]

答案 1 :(得分:0)

更改解析器,我将使用select并直接定位a元素。 'lxml'更宽容,它将处理杂散的a标签,这些标签不应该在那里。另外,find只会返回第一个匹配项,而find_all只会返回所有匹配项。

例如

<a href="/attorneys/?asf_ugs=257">Carleton College</a></a>

  

流浪结束标记a。

     

从第231行的第127列;到第231行第130列

     

ollege</a></a>, 2013

     

流浪结束标记a。

     

从231行的239列开始;到第231行的第242列

     

of Law</a></a>, J.D.

source

import requests
from bs4 import BeautifulSoup as soup

url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
personal_soup = soup(res.content, "lxml")    
educations = [a.text for a in personal_soup.select('.attorney--education a')]
print(educations)