我的任务是从URL列表中提取title和meta_description。我用过鹅。我做得对吗?
来自goose import Goose的import urlparse 导入numpy为np 进口口 进口熊猫
os.chdir("C:\Users\EDAWES01\Desktop\Cookie profiling")
data = pandas.read_csv('activity_url.csv', delimiter=';')
data_read=np.array(data)
quantity = data_read[0:, 2]
url_data = data_read[quantity==1][0:3,1]
user_id = data_read[quantity==1][0:3,0]
url_data
#remove '~oref='
clean_url_data=[] #intialize
for i in xrange(0,len(url_data)):
clean_url_data.append(i)
clean_url_data[i]=urlparse.urlparse(url_data[i])[2].split("=")
clean_url_data[i]=clean_url_data[i][1]
clean_url_data=np.array([clean_url_data])
#store title
website_title=[]
#store meta_description
website_meta_description=[]
g=Goose()
for urlt in xrange(0, len(clean_url_data)):
website_title.append(urlt)
website_title[urlt]=g.extract(clean_url_data[urlt])
website_title[urlt]=website_title[urlt].title
website_title=np.array([website_title])
for urlw in xrange(0, len(clean_url_data)):
website_meta_description.append(urlw)
website_meta_description[urlw]=g.extract(clean_url_data[urlw])
website_meta_description[urlw]=website_meta_description[urlw].meta_description
website_meta_desciption=np.array([website_meta_description])
答案 0 :(得分:0)
您可以打开网址并将其分配给任何频道。当您阅读它并存储在任何变量中时,这将是带有html标记和值的页面源。来自该页面的所需信息,您可以使用符合搜索条件的正则表达式进行提取。你可以这样做:
pvt1 = pd.pivot_table(df,
values='duration1',
index=['project_id','resource'],
columns=['activity'],
aggfunc=np.sum,
fill_value=0)
print (pvt1)
activity Design Development Practise Support Testing \
project_id resource
3 Arya Stark 00:00:00 00:42:00 02:09:00 00:00:00 00:00:00
catelyn stark 00:00:00 01:46:12 00:00:00 00:00:00 00:00:00
4 Benjan Stark 01:01:48 00:00:00 00:00:00 00:00:00 00:00:00
Bran Stark 00:00:00 00:00:00 00:00:00 00:00:00 01:46:12
Ned Stark 00:00:00 02:04:12 00:00:00 00:00:00 00:00:00
Sansa Stark 00:00:00 03:21:00 00:00:00 00:00:00 00:00:00
7 Robb Stark 00:00:00 02:52:48 00:00:00 00:00:00 00:00:00
9 Jon Snow 00:00:00 00:00:00 00:00:00 01:26:24 00:00:00
Rickon Stark 00:00:00 02:10:12 00:00:00 00:00:00 00:00:00
All 01:01:48 12:56:24 02:09:00 01:26:24 01:46:12
activity All
project_id resource
3 Arya Stark 02:51:00
catelyn stark 01:46:12
4 Benjan Stark 01:01:48
Bran Stark 01:46:12
Ned Stark 02:04:12
Sansa Stark 03:21:00
7 Robb Stark 02:52:48
9 Jon Snow 01:26:24
Rickon Stark 02:10:12
All 19:19:48
变量页面将为您提供所有html页面标记和结构。 您可以编写任何常规扩展来获取所需的详细信息。 再说re.findall(r' https?://.*?/',页面),会给你所有的网址。 同样,您可以从页面
中获取所需的详细信息