在python中查找命令仅捕获第一行

时间:2016-06-12 15:07:20

标签: python find text-manipulation

尝试从以下代码中获取磁力链接

rawdata = ''' <div class="iaconbox center floatright">
            <a rel="12624681,0" class="icommentjs kaButton smallButton rightButton" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html#comment">209 <i class="ka ka-comment"></i></a>               <a class="icon16" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html" title="Verified Torrent"><i class="ka ka16 ka-verify ka-green"></i></a>                                <div data-sc-replace="" data-sc-slot="_ae58c272c09a10c792c6b17d55c20208" class="none" data-sc-params="{ &#39;name&#39;: &#39;Zootopia%202016%201080p%20HDRip%20x264%20AC3-JYK&#39;, &#39;extension&#39;: &#39;mkv&#39;, &#39;magnet&#39;: &#39;magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&amp;dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&amp;tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&amp;tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&#39; }"></div>
            <a data-nop="" title="Torrent magnet link" href="magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&amp;dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&amp;tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&amp;tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce" class="icon16 askFeedbackjs" data-id="CE8357DED670F06329F6028D2F2CEA6F514646E0"><i class="ka ka16 ka-magnet"></i></a>
            <a data-download="" title="Download torrent file" href="https://kat.cr/torrents/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681/" class="icon16 askFeedbackjs"><i class="ka ka16 ka-arrow-down"></i></a>
        </div> '''

使用此命令

rawdata[rawdata.find("<")+1:rawdata.find(">")]

给我

  

div class =“iaconbox center floatright”

但是当我试图找到磁力链接时

rawdata[rawdata.find("href="magnet:?")+1:rawdata.find(""")]

它给了我

  

''

我真正希望它给我的是什么

  

磁体:XT =瓮:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&安培; DN =动物乌托邦+ 2016 + 1080 + HDRIP + X264 + AC3 + JYK&安培; TR = UDP%3A%2F%2Ftracker.publicbt.com%2Fannounce&安培; TR = UDP% 3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&安培; TR = UDP%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&安培; TR = UDP%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce

使用Shell非常容易,但必须使用Python本身。

4 个答案:

答案 0 :(得分:1)

尝试rawdata[rawdata.find('href="magnet:?')+1:rawdata.find('"')]

答案 1 :(得分:1)

最好使用正则表达式。

import re

rawdata = '''your rawdata......'''
regex = re.compile('href="(.+)" class="icon16')
magnet_href = regex.search(rawdata).group(1)

答案 2 :(得分:1)

首先,正如HenryM所指出的,你需要使用单引号或转义"来使字符串有效。

其次,find()始终返回找到的字符的第一个索引。因此,您将找到第一个"而不是结束链接的那个。要解决此问题,请使用beg参数来定义搜索的开头。

此外,您需要将查询的长度添加到起始索引,因为find为您提供匹配的起始索引,而不是您要查找的结尾。代码看起来像这样(完全未经测试):

start = rawdata.find('href="magnet:?') + 14
end = rawdata.find('"', beg=start)
link = rawdata[start:end]

答案 3 :(得分:1)

输入数据是HTML片段。 You should not be using regular expressions to parse it

改为使用解析器。以下是使用BeautifulSoup HTML parser的工作示例:

from bs4 import BeautifulSoup


rawdata = ''' <div class="iaconbox center floatright">
    <a rel="12624681,0" class="icommentjs kaButton smallButton rightButton" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html#comment">209 <i class="ka ka-comment"></i></a>               <a class="icon16" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html" title="Verified Torrent"><i class="ka ka16 ka-verify ka-green"></i></a>                                <div data-sc-replace="" data-sc-slot="_ae58c272c09a10c792c6b17d55c20208" class="none" data-sc-params="{ &#39;name&#39;: &#39;Zootopia%202016%201080p%20HDRip%20x264%20AC3-JYK&#39;, &#39;extension&#39;: &#39;mkv&#39;, &#39;magnet&#39;: &#39;magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&amp;dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&amp;tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&amp;tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&#39; }"></div>
    <a data-nop="" title="Torrent magnet link" href="magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&amp;dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&amp;tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&amp;tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce" class="icon16 askFeedbackjs" data-id="CE8357DED670F06329F6028D2F2CEA6F514646E0"><i class="ka ka16 ka-magnet"></i></a>
    <a data-download="" title="Download torrent file" href="https://kat.cr/torrents/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681/" class="icon16 askFeedbackjs"><i class="ka ka16 ka-arrow-down"></i></a>
</div> '''

soup = BeautifulSoup(rawdata, "html.parser")
print(soup.find("a", title="Torrent magnet link")["href"])

打印:

magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce