我正试图从在线语料库中删除。这些文本在网站上以树状方式排列:单击A打开B页面,在B中,单击C,打开文本。在A中有大约50个链接,在B中,它在3到150之间变化,有时在C中也有链接,但我对它们不感兴趣。
以下是我为实现这一点所做的工作:我打开了A,我用BeautifulSoup解析了它,我收集了我想要的链接,并将其保存为.txt文件。然后我做了以下事情:
Url_List=[]
with open("Aramaic_Url_List.txt", "r") as Url_List:
urls=Url_List.read()
A_url_list=urls.splitlines()
Yeni_A_url_list=[showsubtexts for showsubtexts in A_url_list if len(showsubtexts)>52]
它以列表形式从页面A给了我所需的所有链接。
然后我写了一个小脚本来测试我是否可以从列表Yeni_A_url_list
的元素中获取B页面中的链接,这是我的脚本:
data2=requests.get(Yeni_A_url_list[1].strip())
data2.raise_for_status()
data2_Metin=data2.text
soup_data2=BeautifulSoup(data2_Metin, "lxml")
for link in soup_data2.find_all("a"):
print(link.get("href"))
条带可能没有任何功能,但我认为它不会受到伤害。该脚本对于元素工作得相当好。所以我想,是时候编写一个函数来获取页面A中每个链接的页面B级别的所有链接。所以这是我的函数:
def ListedenLinkAl(h):
if h in Yeni_A_url_list:
print(h)
g=requests.get(h)
g.raise_for_status()
data_mtn=g.text
data_soup=BeautifulSoup(data_mtn,"lxml")
oP=[b.get("href") for b in data_soup.find_all("a")]
tk=list(set(oP))
sleep(3)
return tk
打印是为了让我看到函数已经解决的链接,并且睡眠不会使服务器过度充电,但由于某种原因time.sleep显示语法错误。该函数也适用于列表中的单个元素,这意味着以下工作:ListedenLinkAl(Yeni_A_url_list[1])
所以我想,是时候将这个函数应用于列表Yeni_A_url_list
的每个元素并进行列表理解了:
Temiz_url_Listesi=[ListedenLinkAl(x) for x in Yeni_A_url_list]
我收到以下错误:
In [45]: Temiz_url_Listesi=[ListedenLinkAl(x) for x in Yeni_A_url_list]
http://cal1.cn.huc.edu/showsubtexts.php?keyword=21200
Traceback (most recent call last):
File "<ipython-input-45-8e4811c83c3f>", line 1, in <module>
Temiz_url_Listesi=[ListedenLinkAl(x) for x in Yeni_A_url_list]
File "<ipython-input-45-8e4811c83c3f>", line 1, in <listcomp>
Temiz_url_Listesi=[ListedenLinkAl(x) for x in Yeni_A_url_list]
File "<ipython-input-36-390e6ed1eae5>", line 6, in ListedenLinkAl
g=requests.get(h)
File "/home/dk/anaconda3/lib/python3.5/site-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/home/dk/anaconda3/lib/python3.5/site-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/home/dk/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/home/dk/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 570, in send
adapter = self.get_adapter(url=request.url)
File "/home/dk/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 644, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
InvalidSchema: No connection adapters were found for 'http://cal1.cn.huc.edu/showsubtexts.php?keyword=21200'
In [46]:
我不知道为什么该函数适用于列表中的单个元素,但不适用于列表推导。
答案 0 :(得分:0)
看起来网址周围有额外的字符,请使用str.strip()
进行清理:
g = requests.get(h.strip())