Question

我目前正在创建一个python脚本，允许用户输入torrent的哈希（通过终端），并通过网站检查更多的跟踪器。但是，由于我是Python编程新手，所以我很茫然，并希望得到一些建议。我遇到了麻烦，因为我的html_page结果有另一个链接要去。所以，我的程序分配html_page“http://torrentz.eu/ * ** * *** 但是，现在我发现自己在尝试要让它按照页面上的其他链接到达http://torrentz.eu/announcelist_ * ...说，我发现它可以被检索（因为它会从查看来源时出现）

    <a href="/announcelist_********" rel="e">&#181;Torrent compatible list here</a>

或者可能从这里检索，因为它们与/ announcelist_ * *

中出现的值相同

    <a name="post-comment"></a>
    <input type="hidden" name="torrent" value="******" />

由于/ announcelist_ * *以文本格式显示，我还想知道如何将生成的跟踪器列表保存在.txt文件中。话虽如此，这是我目前在Python脚本上的进步。

    from BeautifulSoup import BeautifulSoup
    import urllib2
    import re
    var = raw_input("Enter hash:")
    html_page = urllib2.urlopen("http://torrentz.eu/" +var)
    soup = BeautifulSoup(html_page)
    for link in soup.findAll('a'):
            print link.get('href')

我还要提前感谢你们所有人的支持，知识，建议和技能。

编辑：我已将代码更改为如下所示：

    from BeautifulSoup import BeautifulSoup
    import urllib2
    import re
    hsh = raw_input("Enter Hash:")
    html_data = urllib2.urlopen("http://torrentz.eu/" +hsh, 'r').read()
    soup = BeautifulSoup(html_data)
    announce = soup.find('a', attrs={'href': re.compile("^/announcelist")})
    print announce

结果是：

    <a href="/announcelist_00000" rel="e">&#181;Torrent compatible list here</a>

所以，现在我只想找到一种方法来获得 / announcelist_00000 部分的输出。

Answer 1

打开网址后，您可以在指出时找到href。现在，使用href打开urlopen。当您遇到要复制的文件时，请按以下方式打开它：

remote_file = open(filepath)
locale_file = open(path_to_local_file, 'w')

local_file.write(remote_file.read())
local_file.close()
remote_file.close()

以下是您应该如何做到这一点：

# insert code that you've already written
for link in soup.findAll('a'):
    print link.get('href')
    remote_file = open(link.get('href'))
    local_file = open(path_too_local_file, 'w')
    local_file.write(remote_file.read())
    local_file.close()
    remote_file.close()

我没有测试过这段代码，但我认为它应该可行。

希望这有帮助

Answer 2

如果你要找的是href属性的值，那么如果你添加一行，看看你得到了什么：

print announce['href']

Python href并保存到.txt（不用担心，不是另一个正则表达式问题）

2 个答案: