使用Python进行刮擦,由excel vba

时间:2015-07-07 09:36:19

标签: python vba web-scraping

我之前已经有了一个问题,但是它已粘贴在vba标签等中。所以我会再次尝试使用正确的标签和标题,因为我现在获得了一些知识,希望如此。

问题: 我需要从数据库with plant variety data中找到大约1000个日期,这可能是登录后面所以这里是截图PLUTO database。现在我当然可以填写这个表格~1000次,但必须有一个更聪明的方法来做到这一点。如果它是一个HTML网站,我会知道该怎么做,并让vba只是拉入结果。我整个上午都在阅读这些javascript页面和ajax库,但它超出了我的水平。所以希望有人可以帮我一点。当我按下搜索时,我还使用了firebug来查看发生了什么:

这些代码与发布的最后一张图片相似,便于阅读。代码留在这里进行复制。

f.cc.facet.limit    
-1
f.cc.facet.mincount 
1
f.end_date.facet.date.end   
2030-01-01T00:00:00Z
f.end_date.facet.date.gap   
+5YEARS
f.end_date.facet.date.oth...    
all
f.end_date.facet.date.sta...    
1945-01-01T00:00:00Z
f.end_type.facet.limit  
20
f.end_type.facet.mincount   
1
f.grant_start_date.facet....    
NOW/YEAR
f.grant_start_date.facet....    
+5YEARS
f.grant_start_date.facet....    
all
f.grant_start_date.facet....    
1900-01-01T00:00:00Z
f.status.facet.limit    
20
f.status.facet.mincount 
1
f.type.facet.limit  
20
f.type.facet.mincount   
1
facet   
true
facet.date  
grant_start_date
facet.date  
end_date
facet.field 
cc
facet.field 
type
facet.field 
status
facet.field 
end_type
fl  
uc,cc,type,latin_name,common_name,common_name_en,common_name_others,app_num,app_date,grant_start_date
,den_info,den_final,id
hl  
true
hl.fl   
cc,latin_name,den_info,den_final
hl.fragsize 
5000
hl.requireFieldMatch    
false
json.nl 
map
q   
cc:IT AND latin_name:(Zea Mays) AND den_info:Antilles
qi  
3-9BgbCWwYBd7aIWPU1/onjQ==
rows    
25
sort    
uc asc,score desc
start   
0
type    
upov
wt  
json

来源

fl=uc%2Ccc%2Ctype%2Clatin_name%2Ccommon_name%2Ccommon_name_en%2Ccommon_name_others%2Capp_num%2Capp_date
%2Cgrant_start_date%2Cden_info%2Cden_final%2Cid&hl=true&hl.fragsize=5000&hl.requireFieldMatch=false&json
.nl=map&wt=json&type=upov&sort=uc%20asc%2Cscore%20desc&rows=25&start=0&qi=3-9BgbCWwYBd7aIWPU1%2FonjQ
%3D%3D&hl.fl=cc%2Clatin_name%2Cden_info%2Cden_final&q=cc%3AIT%20AND%20latin_name%3A(Zea%20Mays)%20AND
%20den_info%3AAntilles&facet=true&f.cc.facet.limit=-1&f.cc.facet.mincount=1&f.type.facet.limit=20&f.type
.facet.mincount=1&f.status.facet.limit=20&f.status.facet.mincount=1&f.end_type.facet.limit=20&f.end_type
.facet.mincount=1&f.grant_start_date.facet.date.start=1900-01-01T00%3A00%3A00Z&f.grant_start_date.facet
.date.end=NOW%2FYEAR&f.grant_start_date.facet.date.gap=%2B5YEARS&f.grant_start_date.facet.date.other
=all&f.end_date.facet.date.start=1945-01-01T00%3A00%3A00Z&f.end_date.facet.date.end=2030-01-01T00%3A00
%3A00Z&f.end_date.facet.date.gap=%2B5YEARS&f.end_date.facet.date.other=all&facet.field=cc&facet.field
=type&facet.field=status&facet.field=end_type&facet.date=grant_start_date&facet.date=end_date

这就是它在HTML中的样子,至少根据萤火虫来说:

{"response":{"start":0,"docs":[{"id":"6751513","grant_start_date":"1999-02-04T22:59:59Z","den_final":"Antilles","app_num":"005642_A 005642","latin_name":"Zea mays L.","common_name_others":["MAIS"],"uc":"ZEAAA_MAY","type":"NLI","app_date":"1997-01-10T22:59:59Z","cc":"IT"}],"numFound":1},"qi":"3-9BgbCWwYBd7aIWPU1/onjQ==","facet_counts":{"facet_queries":{},"facet_ranges":{},"facet_dates":{"end_date":{"after":0,"start":"1945-01-01T00:00:00Z","before":0,"2010-01-01T00:00:00Z":1,"between":1,"end":"2030-01-01T00:00:00Z","gap":"+5YEARS"},"grant_start_date":{"after":0,"1995-01-01T00:00:00Z":1,"start":"1900-01-01T00:00:00Z","before":0,"between":1,"end":"2015-01-01T00:00:00Z","gap":"+5YEARS"}},"facet_intervals":{},"facet_fields":{"status":{"approved":1},"end_type":{"ter":1},"type":{"nli":1},"cc":{"it":1}}},"sv":"bswa1.wipo.int","lastUpdated":1435987857572,"highlighting":{"6751513":{"den_final":["Antilles<\/em>"],"latin_name":["Zea<\/em> mays<\/em> L."],"cc":["IT<\/em>"]}}}

编辑: 它使用GET方法和XMLHttpRequest,如此截图所示: enter image description here

我已经找到了如何从excel vba运行python in this topic 我也下载了漂亮的汤,但python不是我的语言,所以任何帮助都会非常感激。

图片在评论Will enter image description here

的回答时提到

2 个答案:

答案 0 :(得分:1)

1)使用Excel存储搜索参数。

2)运行一些手动搜索,找出每个请求需要更改的参数。

3)向你在firebug / Fiddler中找到的网址调用http get请求(手动点击“搜索”时调用的网址)。请参阅Urllib3 https://urllib3.readthedocs.org/en/latest/

3)查看Json pickle以帮助您处理json响应,将其保存(序列化)到文件中。

4)读取和写入数据涉及IO库。谷歌是你的朋友。 (可能更容易将您的Excel文件保存为csv,然后只读取搜索参数的csv文件)。

5)下载PyCharm用于你的python开发 - 它真的很棒。

希望这有帮助。

答案 1 :(得分:0)

我终于明白了。我不需要使用python,我可以只使用一个url,然后将内容导入excel。我发现Fiddler的网址应该是https://www3.wipo.int/pluto/user/jsp/select.jsp?然后OP的代码就落后于此。

我的解决方案的其余部分可以在another question I had中找到。它不使用Python,只使用VBA,它命令IE打开网站并复制其内容。