Question

我正在尝试从某些Java脚本中过滤出链接。 Java脚本部分不再重要，因为我已将其转换为字符串（文本）。

这是脚本部分：

<script>                
                					
					setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
                
    
                $(function() {
                    $("#whats_new_panels").bxSlider({
                        controls: false,
                        auto: true,
                        pause: 15000
                    });
                });
                setTimeout(function(){
                    $("#download_messaging").hide();
                    $("#next_button").show();
                }, 10000);
            </script>

这是我的工作：

import re

def get_link_from_text(text):
   text = text.replace('\n', '')
   text = text.replace('\t', '')
   text = re.sub(' +', ' ', text)

   search_for = re.compile("href[ ]*=[ ]*'[^;]*")
   debug = re.search(search_for, text)

   return debug

我想要的是href链接，我有点了解它，但是出于某种原因，只有这样

<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/dow>

而不是像我想要的那样

<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'">

所以我的问题是如何获得完整的链接，而不仅仅是其中的一部分。

问题可能是re.search没有返回更长的字符串吗？因为我尝试更改RegEx，所以我什至尝试将链接1与1匹配，但它仍然仅返回我之前调用的部分。

Answer 1

我已经对其进行了少许修改，但是对我来说，它返回了您现在想要的完整字符串。

import re

text = """
<script>                

setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);


    $(function() {
        $("#whats_new_panels").bxSlider({
            controls: false,
            auto: true,
            pause: 15000
        });
    });

    setTimeout(function(){
        $("#download_messaging").hide();
         $("#next_button").show();
    }, 10000);
</script>
"""

def get_link_from_text(text):
   text = text.replace('\n', '')
   text = text.replace('\t', '')
   text = re.sub(' +', ' ', text)

   search_for = re.compile("href[ ]*=[ ]*'[^;]*")
   debug = search_for.findall(text)

   print(debug)

get_link_from_text(text)

输出：

["href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'"]

Python：如何与RegEx完全匹配

1 个答案: