Question

请准备好长时间阅读。我处于停滞状态，不知道在哪里寻找答案/还有什么可以尝试。毋庸置疑，我对编程有点新意。过去几周一直在攻击这个项目。

问题

我得到了这张表，25行，2列。每一行的结构如下：

需要的活动

<td align=center>19/11/11<br>12:01:21 AM</td>
<td align=center><font color=#006633><a href=profiles.php?XID=1><font color=#006633>player1</font></a> hospitalized <a href=profiles.php?XID=2><font color=#006633>player2</font></a></font></td>

不需要的活动案例A

<td align="center">19/11/11<br />12:58:03 AM</td>
<td align="center"><font color="#AA0000">Someone hospitalized <a href=profiles.php?XID=1><font color="#AA0000">player1</font></a></font></td>

不需要的活动案例B

<td align="center">19/11/11<br />12:58:03 AM</td>
<td align=center><font color=#006633><a href=profiles.php?XID=3><font color=#006633>player3</font></a> attacked <a href=profiles.php?XID=1><font color=#006633>player1</font></a> and lost </font></td>

我使用正则表达式来抓取所需的数据。我的问题是2个列表没有事件匹配。日期和时间并不总是与确切事件匹配。

第一次尝试解决问题

import mechanize  
import re

htmlA1 = br.response().read()

patAttackDate = re.compile('<td align=center>(\d+/\d+/\d+)<br>(\d+:\d+:\d+ \w+)')
patAttackName = re.compile('<font color=#006633>(\w+)</font></a> hospitalized ')
searchAttackDate = re.findall(patAttackDate, htmlA1)
searchAttackName = re.findall(patAttackName, htmlA1)

pairs = zip(searchAttackDate, searchAttackName)

for i in pairs:
print (i)

但这会让我获得wrong time - correct event类型的列表。

例如：

(('19/11/11', '9:47:51 PM'), 'user1') <- mismatch 
(('19/11/11', '8:21:18 PM'), 'user1') <- mismatch
(('19/11/11', '7:33:00 PM'), 'user1') <- As a consequence of the below, the rest upwards are mismatched 
(('19/11/11', '7:32:38 PM'), 'user2') <- NOT a match, case B
(('19/11/11', '7:32:22 PM'), 'user2') <- match ok
(('19/11/11', '7:26:53 PM'), 'user2') <- match ok
(('19/11/11', '7:25:24 PM'), 'user3') <- match ok
(('19/11/11', '7:24:22 PM'), 'user3') <- match ok
(('19/11/11', '7:23:25 PM'), 'user3') <- match ok

解决问题的第二次尝试

所以想把整个页面中的newline去掉并刮掉桌子，但是：

import mechanize
import re
from BeautifulSoup import BeautifulSoup

htmlA1 = br.response().read()

stripped = htmlA1.replace(">\n<","><") #Removed all '\n' from code

soup = BeautifulSoup(stripped)

table = soup.find('table', width='90%')
table2 = table.findNext('table', width='90%')
table3 = table2.findNext('table', width='90%') #this is the table I need to work with

patAttackDate = re.compile('<td align="center">(\d+/\d+/\d+)<br />(\d+:\d+:\d+ \w+)')
searchAttackDate = re.findall(patAttackDate, table3)
print searchAttackDate

这给了我一个错误：

return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

我错过了什么？

奖金问题：有没有办法解释XID是动态变量，但在使用regex / beautifulsoup（或其他抓取方法）时绕过它？随着项目“增长”，我可能需要包含代码的XID部分，但不希望与之匹配。（不确定这是否清楚）

感谢您的时间

编辑1 ：添加了列表示例
编辑2 ：使代码分离更加明显 编辑3 ：添加了似乎无效的给定解决方案的示例代码

Test = '''<table><tr><td>date</td></tr></table>'''
soupTest = BeautifulSoup(Test)
test2 = soupTest.find('table')
patTest = re.compile('<td>(.*)</td>')
searchTest = patTest.findall(test2.getText())
print test2 # gives: <table><tr><td>date</td></tr></table> 
print type(test2) # gives: <class 'BeautifulSoup.Tag'>
print searchTest #gives: []

编辑4 - 解决方案

import re
import mechanize
from BeautifulSoup import BeautifulSoup

htmlA1 = br.response().read()
stripped = htmlA1.replace(">\n<","><") #stripped '\n' from html
soup = BeautifulSoup(stripped)

table = soup.find('table', width='90%')
table2 = table.findNext('table', width='90%')
table3 = table2.findNext('table', width='90%') #table I need to work with

print type(table3) # gives <class 'BeautifulSoup.Tag'>
strTable3 = str(table3) #convert table3 to string type so i can regex it

patFinal = re.compile(('(\d+/\d+/\d+)<br />(\d+:\d+:\d+ \w+)</td><td align="center">'
                      '<font color="#006633"><a href="profiles.php\?XID=(\d+)">'
                      '<font color="#006633">(\w+)</font></a> hospitalized <a'), re.IGNORECASE)
searchFinal = re.findall(patFinal, strTable3)

for i in searchFinal:
    print (i)

示例输出

('19/11/11', '1:08:07 AM', 'ID_user1', 'user1')
('19/11/11', '1:06:55 AM', 'ID_user1', 'user1')
('19/11/11', '1:05:46 AM', 'ID_user1', 'user1')
('19/11/11', '1:04:33 AM', 'ID_user1', 'user1')
('19/11/11', '1:03:32 AM', 'ID_user1', 'user1')
('19/11/11', '1:02:37 AM', 'ID_user1', 'user1')
('19/11/11', '1:00:43 AM', 'ID_user1', 'user1')
('19/11/11', '12:55:35 AM', 'ID_user2', 'user2')

编辑5 - 一个更简单的解决方案（第一次尝试 - 没有Beautifulsoup）

import re

reAttack = (r'<td\s+align=center>(\d+/\d+/\d+)<br>(\d+:\d+:\d+\s+\w+)</td>\s*'
            '<td.*?' #accounts for the '\n'
            '<font\s+color=#006633>(\w+)</font></a>\s+hospitalized\s+')

for m in re.finditer(reAttack, htmlA1):
    print 'date: %s; time: %s; player: %s' % (m.group(1), m.group(2), m.group(3))

示例输出

date: 19/11/11; time: 1:08:07 AM; player: user1
date: 19/11/11; time: 1:06:55 AM; player: user1
date: 19/11/11; time: 1:05:46 AM; player: user1
date: 19/11/11; time: 1:04:33 AM; player: user1
date: 19/11/11; time: 1:03:32 AM; player: user1
date: 19/11/11; time: 1:02:37 AM; player: user1
date: 19/11/11; time: 1:00:43 AM; player: user1
date: 19/11/11; time: 12:55:35 AM; player: user2

Answer 1

根据您的描述，我还没有弄清楚您要做什么。但我现在可以告诉你一件事：使用正则表达式，Python原始字符串是你的朋友。

尝试在BeautifulSoup计划中使用r'pattern'而不只是'pattern'。

此外，当您使用正则表达式时，有时从简单模式开始，验证它们是否有效，然后构建它们是有价值的。你已经直接进入了复杂的模式，我确信它们不起作用，因为你没有使用原始字符串而且反斜杠不正确。

Answer 2

.findNext()方法将返回BeautifulSoup.Tag个对象，该对象无法传递给re.findall。因此，您需要使用.getText()（或类似方法从Tag对象获取文本。或.contents以获取该标记内的html）。此外，使用re.compile时，您需要在findall上调用返回的对象。

此：

soup = BeautifulSoup(stripped)

table = soup.find('table', width='90%')
table2 = table.findNext('table', width='90%')
table3 = table2.findNext('table', width='90%') #this is the table I need to work with

patAttackDate = re.compile('<td align="center">(\d+/\d+/\d+)<br />(\d+:\d+:\d+ \w+)')
searchAttackDate = re.findall(patAttackDate, table3)

应该这样写（最后一行是唯一需要改变的东西）：

soup = BeautifulSoup(stripped)

table = soup.find('table', width='90%')
table2 = table.findNext('table', width='90%')
table3 = table2.findNext('table', width='90%')

patAttackDate = re.compile('<td align="center">(\d+/\d+/\d+)<br />(\d+:\d+:\d+ \w+)')
searchAttackDate = patAttackDate.findall(table3.getText())

# or, to search the html inside table3 and not just the text
# searchAttackDate = patAttackDate.findall(str(table3.contents[0]))

BeautifulSoup Documentation

来自re docs：

re.compile(pattern, flags=0)
将正则表达式模式编译为正则表达式对象。

此：
  result = re.match(pattern, string)

相当于：
  prog = re.compile(pattern)
  result = prog.match(string)

Answer 3

这对我有用：

reAttack = r'<td\s+align=center>(\d+/\d+/\d+)<br>(\d+:\d+:\d+\s+\w+)</td>\s*<td.*?<font\s+color=#006633>(\w+)</font></a>\s+hospitalized\s+'

for m in re.finditer(reAttack, htmlA1):
  print 'date: %s; time: %s; player: %s' % (m.group(1), m.group(2), m.group(3))

<强> live demo

在一个正则表达式中做所有事情会产生一个更加混乱的正则表达式，但它比分别匹配每个TD并尝试在之后同步它们要容易得多，就像你正在做的那样。正则表达式中间附近的.*?假设所有元素都由换行符分隔，就像在原始示例中一样。如果您无法假设，则应将.*?替换为(?:(?!/?td>).)*以包含当前TD元素中的匹配项。

仅供参考，您的样本数据存在一些不一致之处。引用了一些属性值，而大多数属性值没有引用，并且您混合了<br>和<br />个标记。我为我的演示规范了所有内容，但如果这代表了您的真实数据，那么您需要一个更复杂的正则表达式。或者您可以切换到纯DOM解决方案，这可能在一开始就更容易。 ;）

Answer 4

对于beautifulsoup解决方案你可以使用它（不检查正则表达式 - 我也确定@steveha对于插件是正确的'）：

searchAttackDate = table3.findAll(patAttackDate)
for row in searchAttackDate:
   print row

Python，匹配不均匀长度的刮擦列表

4 个答案: