如何使用python从URL存储提取的标题?

时间:2016-06-23 07:00:31

标签: python url store meta-tags goose

我的任务是从URL列表中提取title和meta_description。我用过鹅。我做得对吗?

来自goose import Goose的

    import urlparse     导入numpy为np     进口口     进口熊猫

os.chdir("C:\Users\EDAWES01\Desktop\Cookie profiling")
data = pandas.read_csv('activity_url.csv', delimiter=';')
data_read=np.array(data)
quantity = data_read[0:, 2]
url_data = data_read[quantity==1][0:3,1] 
user_id = data_read[quantity==1][0:3,0] 
url_data 

#remove '~oref='
clean_url_data=[] #intialize
for i in xrange(0,len(url_data)):
    clean_url_data.append(i)
    clean_url_data[i]=urlparse.urlparse(url_data[i])[2].split("=")
    clean_url_data[i]=clean_url_data[i][1]

clean_url_data=np.array([clean_url_data])

#store title 
website_title=[]
#store meta_description
website_meta_description=[] 


g=Goose()

for urlt in xrange(0, len(clean_url_data)):
    website_title.append(urlt)
    website_title[urlt]=g.extract(clean_url_data[urlt])
    website_title[urlt]=website_title[urlt].title

website_title=np.array([website_title])

for urlw in xrange(0, len(clean_url_data)):
    website_meta_description.append(urlw)
    website_meta_description[urlw]=g.extract(clean_url_data[urlw])
    website_meta_description[urlw]=website_meta_description[urlw].meta_description


website_meta_desciption=np.array([website_meta_description])

1 个答案:

答案 0 :(得分:0)

您可以打开网址并将其分配给任何频道。当您阅读它并存储在任何变量中时,这将是带有html标记和值的页面源。来自该页面的所需信息,您可以使用符合搜索条件的正则表达式进行提取。你可以这样做:

pvt1 = pd.pivot_table(df, 
                      values='duration1',
                      index=['project_id','resource'], 
                      columns=['activity'], 
                      aggfunc=np.sum,
                      fill_value=0)
print (pvt1)
activity                   Design  Development  Practise  Support  Testing  \
project_id resource                                                          
3          Arya Stark    00:00:00     00:42:00  02:09:00 00:00:00 00:00:00   
           catelyn stark 00:00:00     01:46:12  00:00:00 00:00:00 00:00:00   
4          Benjan Stark  01:01:48     00:00:00  00:00:00 00:00:00 00:00:00   
           Bran Stark    00:00:00     00:00:00  00:00:00 00:00:00 01:46:12   
           Ned Stark     00:00:00     02:04:12  00:00:00 00:00:00 00:00:00   
           Sansa Stark   00:00:00     03:21:00  00:00:00 00:00:00 00:00:00   
7          Robb Stark    00:00:00     02:52:48  00:00:00 00:00:00 00:00:00   
9          Jon Snow      00:00:00     00:00:00  00:00:00 01:26:24 00:00:00   
           Rickon Stark  00:00:00     02:10:12  00:00:00 00:00:00 00:00:00   
All                      01:01:48     12:56:24  02:09:00 01:26:24 01:46:12   

activity                      All  
project_id resource                
3          Arya Stark    02:51:00  
           catelyn stark 01:46:12  
4          Benjan Stark  01:01:48  
           Bran Stark    01:46:12  
           Ned Stark     02:04:12  
           Sansa Stark   03:21:00  
7          Robb Stark    02:52:48  
9          Jon Snow      01:26:24  
           Rickon Stark  02:10:12  
All                      19:19:48  

变量页面将为您提供所有html页面标记和结构。 您可以编写任何常规扩展来获取所需的详细信息。 再说re.findall(r' https?://.*?/',页面),会给你所有的网址。 同样,您可以从页面

中获取所需的详细信息