Question

我想要在维基百科上提取“Google”页面的所有完整图片

我尝试过：

http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json

但是，通过这种方式，我也没有与谷歌相关的图像，例如：

http://upload.wikimedia.org/wikipedia/en/a/a4/Flag_of_the_United_States.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/commons/f/fe/Crystal_Clear_app_browser.png

如何仅提取我在Google page

上看到的图像

Answer 1

检索页面源代码https://en.wikipedia.org/w/index.php?title=Google&action=raw
将其扫描为[[File:Google web search.png|thumb|left|On February 14, 2012, Google updated its homepage with a minor twist. There are no red lines above the options in the black bar, and there is a tab space before the "+You". The sign-in button has also changed, it is no longer in the black bar, instead under it as a button.]]
在页面http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
过滤掉网址，但过滤掉与步骤2中找到的图片名称匹配的网址。

步骤2和4需要更多解释。

@ 2。正则表达式/\b(File|Image):[^]|\n\r]+/应该足够了。在Ruby的regexp中，\b表示可能在您选择的语言中不支持的单词边界。我提出的Regexp将匹配我想到的所有案例：[[File:something.jpg]]，图库标记：<gallery>\nFile:one.jpg\nFile:two.jpg\n</gallery>，模板：{{Infobox|pic = File:something.jpg}}。但是，它与包含]的文件名不匹配。我不确定它们是否合法，但如果它们是合法的，它们必须非常罕见，这应该不是什么大问题。

如果您只想匹配这样的结构：[[File:something.jpg|thumb|description]]，则跟随regexp会更好：/\[\[(File|Image):[^]|]+/

@ 4。我将删除与/[^A-Za-z0-9]/匹配的名称中的所有字符。它比逃避它们更容易，在大多数情况下，足够了。

图标通常附在模板中，与文章主题相关的图片相反，后者通常直接附加（[[File:…]]）。但也有例外，例如在一些文章中，图片附有{{Gallery}}模板。还有<gallery>标记，它为图库引入了特殊语法。你必须根据自己的需要调整我的解决方案，即使这样也不会很完美，但它应该足够好了。

维基百科上给定页面的完整图像（仅限我在页面上看到的）

1 个答案: