Question

我正在使用Python 3.7和Django。我想在HTML页面中搜索字符串。我试过了...

req = urllib2.Request(article.path, headers=settings.HDR)
html = urllib2.urlopen(req, timeout=settings.SOCKET_TIMEOUT_IN_SECONDS).read()
is_present = html.find(token_str) >= 0

但这会导致错误

TypeError: argument should be integer or bytes-like object, not 'str'

抱怨最后一行，我在那儿做“查找”。在HTML中搜索字符串的正确方法是什么？

Answer 1

戴夫！

要从HTML文件中提取数据，我真的建议使用库Beautiful Soup。现在，您可能只是在HTML文件的所有标记中搜索该标记，但是在其他时候，您可能正在寻找更复杂的东西，例如仅在某个段落标记中找到一条字符串。要通过pip安装它：

pip install beautifulsoup4

对于您的情况，这是一个经过测试的代码段，可以解决您的问题。它为您要查找的令牌使用一个简单的正则表达式模式。如果HTML标记中该标记匹配，则返回True。否则为False。

我希望该功能可以帮助您开始使用Beautifulsoup。这是一个非常强大的库。

import re

from bs4 import BeautifulSoup

html_doc = """
<html>
 <head>
  <title>
   Here goes somet title
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    Hello World!
   </b>
  </p>
  <p class="class1">
   Once upon a time..... there was a my_token here....
   <a class="token" href="http://example.com/token" id="link1">
  </p>

  <p class="class2">
   Nope....
  </p>
 </body>
</html>
"""


def search_inside_whole_html_tags_for(html_doc, str_lookup):
    """
    Looks for a str_lookup using a simple regexp pattern. Returns
    True if the str_lookup was found in the whole HTML text. Otherwise,
    returns False.
    """
    html_soup = BeautifulSoup(html_doc, "html.parser")

    # simple regepx pattern: the fixed str lookup
    rslt = html_soup.find_all(text=re.compile(str_lookup))

    return bool(rslt)


print(search_inside_whole_html_tags_for(html_doc, str_lookup="my_tokenx"))
print(search_inside_whole_html_tags_for(html_doc, str_lookup="my_token"))  # this the token

>>> False
>>> True

Answer 2

您正在将字符串与整数进行比较，从而导致类型错误。需要转换为字符串上的整数，或者测试是否为None。

测试：

>>> token_str = 'test'
>>> token_str >= 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>=' not supported between instances of 'str' and 'int'
>>> token_str != None
True

推荐的解决方案：

is_present = html.find(int(token_str)) >= 0

或

is_present = html.find(token_str) != None

在网页上搜索字符串时，获取“ TypeError：参数应为整数或类似字节的对象，而不是'str'”

2 个答案: