我正在使用python 3和漂亮的汤抓取诸如电子邮件和电话号码之类的详细联系信息,其中的网址是通过给定关键字在Google搜索中找到的。
我已经正确地从网址a中抓取了电子邮件,但是我无法从网站中准确地抓取了电话号码。
from bs4 import BeautifulSoup
import sys
import requests
import urllib.request
import pandas as pd
from urllib.request import urlopen,urlparse, Request,HTTPError
import re
import numpy as np
import csv
import json
def get_keyword(word):
try:
from google search import search
except ImportError:
print("No module named 'google' found")
# to search
query = word
url=[]
for j in search (query, tld ="co.uk", num=10, stop=1, pause=2): url.append(j)
return url, word
def scrape(req1, word):
req2=req1
req1 = Request(req1, headers={'User-Agent': 'Mozilla/5.0 Chrome/24.0.1312.27 Safari/537.17 '})
f = url open(req1)
s = f.read().decode('UTF-8')
reg = "((\+\d{1,3}(-| )?\(?\d\)?(-| )?\d{1,3})|(\(?\d{2,3}\)?))(-| )?(\d{3,4})(-| )?(\d{4})(( x| ext)\d{1,5}){0,1}"
phone = re. find all(reg, s)
emails = re. find all(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,3}",s) #Email regex
ph=[]
for i in phone:
g = list(filter(None, i))
g=''.join(g)
ph.append(g)
def Remove(duplicate):
final_list = []
for num in duplicate:
if num not in final_list:
final_list.append(num)
return final_list
k = Remove(ph)
df = pd.DataFrame(k, columns=['phone'])
df2 = pd.DataFrame(emails, columns=['email'])
df3 = pd.DataFrame([req2],columns=['url'])
new_df = df.join([df3,df2])
return new_df
if __name__ == '__main__':
df_new = pd.DataFrame(columns = ['email','url','phone'])
x, y=get_keyword("women entrepreneur")#keyword
print(x)
for i in x:
k = scrape(i, y) #i=links in the list of x which means list of url
df_new = pd.concat([df_new,k],ignore_index=True)
我想从网站上获得确切的电话号码,但实际上我正在获得许多其他号码作为输出。示例(“电话”:“ 1761768436145”)不是正确的电话号码。如果找不到电话号码,则应显示为“找不到电话号码”。