Question

我正在考虑开始一个项目，以便我可以学到更多，并保持我迄今为止学到的东西不会生锈。

很多项目都是新事物，所以我想我会来这里询问有关做什么以及如何去做的建议。

我喜欢Photoshop和玩弄它，所以我想我会把我的项目混合起来。所以我决定我的程序将采取一些措施，为photoshop获取新资源，将它们放在我计算机上的自己的文件夹中。（暂时来自deviantart）

现在我想专注于这样一个页面：

http://browse.deviantart.com/resources/applications/psbrushes/?order=9

我不能流利地理解html源代码中究竟发生了什么，所以很难看出发生了什么。

但是让我说我在那个页面上，我选择了以下选项：

Sorted by Popular
Sorted by All Time 
Sorted by 24 Items Per Page

我的目标是分别转到每个缩略图并抓住以下内容：

The Author
The Title
The Description
Download the File (create folder based on title name)
Download the Image (place in folder with the file above)
Create text file with the author, title, and description in it

我想为页面上的24个项目中的每个项目执行此操作，然后转到下一页并执行相同操作。（我正在考虑通过前五页，因为我没有太多兴趣尝试不太受欢迎的画笔）

所以，我发布这个是为了一个方向感，也许是如何解析这样一个页面以获得我正在寻找的东西。我相信这个项目会让我忙碌一段时间，但我希望它能教我一些东西。

任何帮助和建议总是受到赞赏。

。

编辑

每个页面由以下24个组成：

<div class="tt-a" usericon="http://a.deviantart.net/avatars/s/h/shad0w-gfx.gif" collect_rid="1:19982524">
 <span class="shad0w" style="background-image: url ("http://sh.deviantart.net/shad0w/x/107/150/logo3.png");">
  <a class="t" title="Shad0ws Blood Brush Set by ~Shad0w-GFX, Jun 28, 2005" href="http://Shad0w-GFX.deviantart.com/art/Shad0ws-Blood-Brush-Set-19982524?q=boost%3Apopular+in%3Aresources%2Fapplications%2Fpsbrushes&qo-0">Shad0ws Blood Brush Set</a>

我的假设是，我想从以下内容中获取所有信息：

<a class="t" ... >

因为它包含标题，作者以及下载网址和大图片所在位置的链接。

如果这听起来不对，那么如何获取页面上每个对象的信息呢？（每页24个）我会假设使用CyberNeko。我只是不确定如何到达所在位置以及页面上的每个位置

编辑＃2

我有一些看起来像这样的测试代码：

divs = []
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

page = client.getPage("http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0")

divs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[@class='t']")

divs.each { println it }

XPath是正确的，但打印出来：

<?xml version="1.0" encoding="UTF-8"?><a href="http://Shad0w-GFX.deviantart.com/

技术/ Shad0ws-血刷设置-19982524 Q =升压％3Apopular +在％3Aresources％2Fapplicat 离子％2Fpsbrushes＆amp; qo = 0“class =”t“title =”Shad0ws血刷套装~Shad0w-G FX，2005年6月28日“＆gt; Shad0ws血刷套装

你能解释我需要做什么才能让href从那里出来吗？使用HtmlUnit有一种简单的方法吗？

Answer 1

满足您上面列出的要求实际上非常简单。您可以使用大约50行的简单Groovy脚本来完成它。以下是我将如何处理它：

第一页的网址是 http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0

要获取下一页，只需将offset参数的值增加24： http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=24

现在您知道如何构建您需要使用的页面的URL。要下载此页面的内容，请使用：

def pageUrl = 'http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0'

// get the content as a byte array
byte[] pageContent = new URL(pageUrl).bytes

// or get the content as a String
String pageContentAsString = new URL(pageUrl).text

现在您需要做的就是解析您感兴趣的内容元素，并将其保存在文件中。对于解析，您应该使用HTML解析器，如CyberNeko或Jericho。

Groovy项目（html解析，文件下载，文件创建）

1 个答案: