Question

我需要解析文件并检测空URL 这些是场景：

href = ''(ideally)
href     = '    '

这两种情况虽然第二种情况都有空格，但效果相同。我所做的是将文件中的所有文本都变成一个字符串变量'searchstring'。我已经使用过了对于前一种情况，searchstring.find('href = '')不等于-1，但如果在第二种情况下变化的空间，我不知道我需要做什么来确保我也抓住那些情景... 最初我想到使用索引来捕获索引，并且然后遍历，但这对我来说似乎是一个费力的解决方案.... 这可能看起来很愚蠢，但我是python的新手，从昨天开始学习。任何人都可以分享一些见解

提前多多感谢，菲利普

Answer 1

我首先安装BeautifulSoup ...然后我会循环遍历您的文件并让它为您解析。

从那里你可以做类似的事情：

## import re ## Don't actually need a regex here:

for link in soup.find_all('a'):
    if not link.get('href').strip():
        print link, "... is empty or spacey"
    ## elif re.search(r'^\s*$',link.get('href')):
        ## print link, "... is spacey"

Answer 2

检查bool的长度（或更好，href.strip()）：

In [47]: href = ''

In [48]: len(href.strip())
Out[48]: 0

In [49]: bool(href.strip())
Out[49]: False

In [50]: href = '    '

In [51]: len(href.strip())
Out[51]: 0

In [52]: bool(href.strip())
Out[52]: False

Answer 3

为什么不剥离href

href = href.strip()

或

if href.strip():
    print "not empty"
else:
    print "empty"

Answer 4

您可以使用re。你最好阅读documentation。

>>> import re
>>> s='href=""adjfweofhref="   "'
>>> pattern = re.compile(r'href=[\"\']\s*[\"\']')
>>> pattern.findall(s)
['href=""', 'href="   "']
>>>

Python - 检测空URL - 使用字符串操作

4 个答案: