从craiglists帖子中提取电子邮件

时间:2020-05-13 03:34:36

标签: web-scraping python-requests

有没有一种方法可以从craigslist上的列表中查找电子邮件,而无需使用selenium

import requests,re
from bs4 import BeautifulSoup as bs
url='https://newyork.craigslist.org/wch/prk/d/hawthorne-10x15-drive-up-storage-unit/7122801839.html' #example url
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
res=requests.get(url,headers=headers)

每个请求的电子邮件都会更改(我假设是这样),我尝试了x=re.findall('(\w{32})',res.text),但是它不起作用

1 个答案:

答案 0 :(得分:0)

Craigslist通过向该特殊URL发送POST请求来获取电子邮件地址:

https://newyork.craigslist.org/contactinfo/nyc/prk/U_ID

在这种情况下,此U_ID的值为7122801839(根据您提供的URL)。

您可以这样复制此请求:

from bs4 import BeautifulSoup
import requests
import json

U_ID = "7122801839"

URL = f"https://newyork.craigslist.org/contactinfo/nyc/prk/{U_ID}"

COOKIE_VALUE = "cookie" # Replace this with a valid cookie
HEADERS = { 
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Accept':'*/*',
'Accept-Language':'en-us',
'Accept-Encoding':'gzip, deflate, br',
'Host':'newyork.craigslist.org',
'Origin':'https',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Safari/605.1.15',
'Connection':'keep-alive',
'Referer':'https',
'Content-Length':'44816',
'Cookie':COOKIE_VALUE,
'X-Requested-With':'XMLHttpRequest',
 }


PAYLOAD = {
'MIME Type':'application/x-www-form-urlencoded; charset=UTF-8',
}


response = requests.request(
    method='POST',
    url=URL,
    headers=HEADERS,
    data=PAYLOAD
    )

html = json.loads(response.text)['replyContent']

soup = BeautifulSoup(html,'html.parser')

email = soup.find(class_='mailapp').get('href')
email = email.split('?subject')[0].replace('mailto:','')

print(email)

请注意,如果没有cookie,此代码将无法工作,因此您需要从浏览器中复制cookie。

相关问题