Question

这一直困扰我一段时间，我不能使用正则表达式来找到一个带有Beautifulsoup的字符串，我不知道为什么。

这是我遇到麻烦的一线：

data = soup.find(text=re.compile('Överförda data (skickade/mottagna)

如果需要，这是完整的代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from bs4 import BeautifulSoup

import re
import urllib2

# Fetch URL
url = 'http://192.168.1.254/cgi/b/bb/?be=0&l0=1&l1=-1'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')

# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)

soup = BeautifulSoup(response)

time = soup.find(text="Aktiv tid:").findNext('td').contents[0]
data = soup.find(text=re.compile('Överförda data (skickade/mottagna) [GB/GB]:')).findNext('td').contents[0] # complains about this line

f=open('/var/www/log.txt', 'a')
print(time + ";" + data,file=f)
f.close()

每当我运行它时，会发生类型为AttributeError的错误，说'NoneType'对象没有属性'findNext'

因为我的字符串可以是：

Överförda数据（skickade / mottagna）GB / GB：
Överförda数据（skickade / mottagna）[MB / MB]：

因此我需要使用正则表达式来查看它们是否匹配其中任何一个。

提前谢谢！

（编辑：我现在更改了我的代码（请参阅下面的answer），但它仍然给我同样的错误：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from bs4 import BeautifulSoup

import re
import urllib2

# Fetch URL
url = 'http://192.168.1.254/cgi/b/bb/?be=0&l0=1&l1=-1'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')

# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)

soup = BeautifulSoup(response)

time = soup.find(text="Aktiv tid:").findNext('td').contents[0]
data = soup.find(text=re.compile(re.escape(u'Överförda data (skickade/mottagna) [GB/GB]:'))).findNext('td').contents[0]

f=open('/var/www/log.txt', 'a')
print(time + ";" + data,file=f)
f.close()

以下是HTML file的相关部分：

<table width='100%' class='datatable' cellspacing='0' cellpadding='0'>
  <tr>
    <td>
    </td>
    <td width='30px'>
    </td>
    <td width='220px'>
    </td>
    <td width='50px'>
    </td>
  </tr>
  <tr>
    <td height='7' colspan='4'>
      <img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
    </td>
  </tr>
  <tr>
    <td width='170'>
      Aktiv tid: <!--This is a string I will search for.-->
    </td>
    <td colspan='3'>
      1 dag, 17:03:46 <!--This is a piece of information I need to obtain.-->
    </td>
  </tr>
  <tr>
    <td height='7' colspan='4'>
      <img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
    </td>
  </tr>
  <tr>
    <td width='170'>
      Bandbredd (upp/ned) [kbps/kbps]:
    </td>
    <td colspan='3'>
      1.058 / 21.373
    </td>
  </tr>
  <tr>
    <td height='7' colspan='4'>
      <img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
    </td>
  </tr>
  <tr>
    <td width='170'>
      Överförda data (skickade/mottagna) [GB/GB]: <!--This is another string I will search for.-->
    </td>
    <td colspan='3'>
      1,67 / 42,95 <!--This is another piece of information I need to obtain.-->
    </td>
  </tr>
</table>

）

Answer 1

BeautifulSoup对 unicode 字符串进行操作，但您传入的是字节字符串正则表达式。为表达式使用Unicode文字：

re.compile(re.escape(u'Överförda data (skickade/mottagna) [GB/GB]:'))

我还使用re.escape()来转义元字符（括号和方括号），使其不被解释为正则表达式信息。

Ö和ö的UTF-8编码只匹配确切的字节序列：

>>> 'Överförda'
'\xc3\x96verf\xc3\xb6rda'
>>> u'Överförda'
u'\xd6verf\xf6rda'
>>> print u'Överförda'
Överförda
>>> import re
>>> re.search('Överförda', u'Överförda data (skickade/mottagna) [GB/GB]')
>>> re.search(u'Överförda', u'Överförda data (skickade/mottagna) [GB/GB]')
<_sre.SRE_Match object at 0x107d47ed0>

这需要您在文件顶部制作正确的源代码编码声明，请参阅PEP 263。

Answer 2

方括号和圆括号在正则表达式中是特殊的。如果要匹配这些文字字符（与定义捕获组，字符类等），则需要使用反斜杠转义它们。

Python Beautifulsoup找不到正则表达式

2 个答案: