我正试图抓住在特定律师事务所就读的大学律师,但是我不确定如何抓住此链接中列出的两所大学:https://www.wlrk.com/attorney/hahn/。从第一个链接的图像中可以看出,该律师参加的两所大学都在两个单独的“ li”标签下。
当我运行以下代码时,我只会将html一直保留到第一个“ li”标签的末尾(如第二个链接的图像所示),而不是第二个li部分,因此,我只会得到第一所大学”卡尔顿学院:“
Traceback (most recent call last):
File "main.py", line 78, in <module>
model.load("model.tflearn")
File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\models\dnn.py", line 308, in load
self.trainer.restore(model_file, weights_only, **optargs)
File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\helpers\trainer.py", line 490, in restore
self.restorer.restore(self.session, model_file)
File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tensorflow\python\training\saver.py", line 1278, in restore
compat.as_text(save_path))
ValueError: The passed save_path is not a valid checkpoint: C:\Users\User\model.tflearn
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 80, in <module>
model.fit(training, output, n_epoch=1000, batch_size=8, show_metric=True)
File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\models\dnn.py", line 216, in fit
callbacks=callbacks)
File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\helpers\trainer.py", line 339, in fit
show_metric)
File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\helpers\trainer.py", line 816, in _train
tflearn.is_training(True, session=self.session)
File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tflearn\config.py", line 95, in is_training
tf.get_collection('is_training_ops')[0].eval(session=session)
File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tensorflow\python\framework\ops.py", line 731, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tensorflow\python\framework\ops.py", line 5579, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "C:\Users\User\Anaconda3\envs\newbot\lib\site-packages\tensorflow\python\client\session.py", line 1096, in _run
raise RuntimeError('Attempted to use a closed Session.')
RuntimeError: Attempted to use a closed Session.
答案 0 :(得分:0)
bs仅获取第一个li元素。我不知道为什么。如果您想尝试使用lxml,这是一种方法,
import lxml
from lxml import html
url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
tree = html.fromstring(res.content)
education = tree.xpath("//div[@class='attorney--education']//li/a/text()")
print(education)
输出:
[“卡尔顿学院”,“纽约大学法学院”]
答案 1 :(得分:0)
更改解析器,我将使用select
并直接定位a
元素。 'lxml'更宽容,它将处理杂散的a
标签,这些标签不应该在那里。另外,find
只会返回第一个匹配项,而find_all
只会返回所有匹配项。
例如
<a href="/attorneys/?asf_ugs=257">Carleton College</a></a>
流浪结束标记a。
从第231行的第127列;到第231行第130列
ollege</a></a>, 2013
流浪结束标记a。
从231行的239列开始;到第231行的第242列
of Law</a></a>, J.D.
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
personal_soup = soup(res.content, "lxml")
educations = [a.text for a in personal_soup.select('.attorney--education a')]
print(educations)