使用PHP从必须登录的网站(Reddit)中刮取和使用数据?

时间:2010-02-13 05:52:51

标签: php screen-scraping reddit

我想创建一个网页,给定两个reddit用户名及其密码,将user2订阅到user1订阅的所有subreddits。所以我需要:

  1. 获取user1订阅的子版本。
  2. 将user2订阅到那些reddits
  3. 我有使用PHP的经验,但我没有抓取经验(特别是当用户必须登录时),并且还提交了“订阅”用户到subreddit所需的信息类型。有没有人对如何做到这一点有任何想法?

    此致

2 个答案:

答案 0 :(得分:1)

假设这不是针对reddits的服务条款,使用cURL登录,可能很容易regex必要的信息。从那里开始检查reddit如何订阅收藏夹并导航到正确的URL或发布表单数据。

我称之为中级任务,只要它不违反reddit服务条款。

答案 1 :(得分:0)

开源产品TestPlan非常擅长此类事情。使用简单的语言,您可以使用一个用户登录该站点,获取subreddits的名称,然后以其他用户身份登录以订阅这些组。

例如,如果您只想要顶部条目的标题,则可以使用此代码:

GotoURL http://www.reddit.com/top/

set %Topics% as response //p[@class='title']
foreach %Topic% in %Topics%
    set %Title% as selectIn %Topic% string(.)
    Notice %Title%
end

产生如下输出:

00000000-00 GOTOURL http://www.reddit.com/top/
00000001-00 NOTICE LEGAL DVD vs. PIRATED COPY (i.imgur.com)
00000002-00 NOTICE Don't just shorten your URL, make it suspicious and frightening. - ShadyURL (shadyurl.com)
00000003-00 NOTICE HOLY CRAP! IS THAT A ROOM FOR RENT ON MY CRAIGSLIST??!?!? (houston.craigslist.org)
00000004-00 NOTICE Years from now when our children ask us, "What did we do after 9/11?" we shall explain it to them using this... (4gifs.com)
00000005-00 NOTICE TSA forces disabled boy to remove leg braces and walk through the metal detector. "I told him, 'This is overkill. He's 4 years old. I don't think he's a terrorist.' " (philly.com)
00000006-00 NOTICE This picture scares the shit out of me. (imgur.com)
00000007-00 NOTICE Civilization V Announced, in Development at Firaxis Games (hellforge.gameriot.com)
00000008-00 NOTICE I don't know, the price seems a little steep... [pic] (i.imgur.com)
00000009-00 NOTICE Reddit, last week we saw the depth of the ocean scaled relative to human size. I made a figure of the depth of the ocean accurately scaled to the width. It's really very shallow from this perspective. (i.imgur.com)