Question

所以，我有这段代码：

url = 'http://google.com'
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read()
links = linkregex.findall(msg)

然后python返回此错误：

links = linkregex.findall(msg)
TypeError: can't use a string pattern on a bytes-like object

我做错了什么？

Answer 1

TypeError: can't use a string pattern   on a bytes-like object

我做错了什么？

您在字节对象上使用了字符串模式。改为使用字节模式：

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')
                       ^
            Add the b there, it makes it into a bytes object

（PS：

 >>> from disclaimer include dont_use_regexp_on_html
 "Use BeautifulSoup or lxml instead."

）

Answer 2

如果您运行的是Python 2.6，那么“urllib”中没有任何“请求”。所以第三行变为：

m = urllib.urlopen(url)

在版本3中你应该使用它：

links = linkregex.findall(str(msg))

因为'msg'是一个bytes对象而不是findall（）所期望的字符串。或者您可以使用正确的编码进行解码。例如，如果“latin1”是编码，那么：

links = linkregex.findall(msg.decode("latin1"))

Answer 3

好吧，我的Python版本没有带有请求属性的urllib但是如果我使用“urllib.urlopen（url）”我没有得到一个字符串，我得到一个对象。这是类型错误。

Answer 4

您使用Google的网址对我不起作用，因此我将http://www.google.com/ig?hl=en替换为适用于我的网址。

试试这个：

import re
import urllib.request

url="http://www.google.com/ig?hl=en"
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read():
links = linkregex.findall(str(msg))
print(links)

希望这有帮助。

Answer 5

正则表达式模式和字符串必须属于同一类型。如果您匹配常规字符串，则需要字符串模式。如果要匹配字节字符串，则需要字节模式。

在这种情况下， m.read（）返回一个字节字符串，因此您需要一个字节模式。在Python 3中，常规字符串是unicode字符串，您需要 b 修饰符来指定字节字符串文字：

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')

Answer 6

这在python3中对我有用。希望这有帮助

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, str(htmltext))
    print(titles)
    i+=1

此外，我在regex之前添加了 b ，将其转换为字节数组。

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = b'<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, htmltext)
    print(titles)
    i+=1

正则表达式上的Python TypeError

6 个答案: