希望你们一切顺利!我是新手并使用Python 2.7!我想从一个似乎没有API的公共可用目录网站中提取电子邮件:这是网站:http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search
,代码停止收集电子邮件在底部页面上的位置,其中显示"加载更多"!
这是我的代码:
import requests
import re
from bs4 import BeautifulSoup
file_handler = open('mail.txt','w')
soup = BeautifulSoup(requests.get('http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search').content)
tags = soup('a')
list_new =[]
for tag in tags:
if (re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>',('%s'%tag))): list_new = list_new +(re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>', ('%s'%tag)))
for x in list_new:
file_handler.write('%s\n'%x)
file_handler.close()
如何确保代码一直运行到目录的末尾并且不会停止显示加载的位置? 谢谢。 最热烈的问候
答案 0 :(得分:1)
您只需要发布一些数据,特别是递增group_no
以模拟点击加载更多按钮:
from bs4 import BeautifulSoup
import requests
# you can set whatever here to influence the results
data = {"group_no": "1",
"search": "category",
"segment": "",
"activity": "",
"retail": "",
"category": "",
"Bpark": "",
"alpha": ""}
post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"
with requests.Session() as s:
soup = BeautifulSoup(
s.get("http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search").content,
"html.parser")
print([a["href"] for a in soup.select("a[href^=mailto:]")])
for i in range(1, 5):
data["group_no"] = str(i)
soup = BeautifulSoup(s.post(post, data=data).content, "html.parser")
print([a["href"] for a in soup.select("a[href^=mailto:]")])
要一直到最后,你可以循环直到帖子没有返回html,这表示我们无法再加载任何页面:
def yield_all_mails():
data = {"group_no": "1",
"search": "category",
"segment": "",
"activity": "",
"retail": "",
"category": "",
"Bpark": "",
"alpha": ""}
post = "http://www.tecomdirectory.com/getautocomplete_keyword.php"
start = "http://www.tecomdirectory.com/companies.php?segment=&activity=&search=category&submit=Search"
with requests.Session() as s:
resp = s.get(start)
soup = BeautifulSoup(s.get(start).content, "html.parser")
yield (a["href"] for a in soup.select("a[href^=mailto:]"))
i = 1
while resp.content.strip():
data["group_no"] = str(i)
resp = s.post(post, data=data)
soup = BeautifulSoup(resp.content, "html.parser")
yield (a["href"] for a in soup.select("a[href^=mailto:]"))
i += 1
因此,如果我们运行如下功能设置"alpha": "Z"
来迭代Z&#39;
from itertools import chain
for mail in chain.from_iterable(yield_all_mails()):
print(mail)
我们会得到:
mailto:info@10pearls.com
mailto:fady@24group.ae
mailto:pepe@2heads.tv
mailto:2interact@2interact.us
mailto:gc@worldig.com
mailto:marilyn.pais@3i-infotech.com
mailto:3mgulf@mmm.com
mailto:venkat@4gid.com
mailto:info@4power.biz
mailto:info@4sstudyabroad.com
mailto:fouad@622agency.com
mailto:sahar@7quality.com
mailto:mike.atack@8ack.com
mailto:zyara@emirates.net.ae
mailto:aokasha@zynx.com
Process finished with exit code 0
你应该在请求之间休息一下,这样你就不会破坏服务器并让自己被阻止。