我有2个具有多个细微差别的HTML字符串:
<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goal4s</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">1</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">9</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusivey Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">7</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">1</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmsasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>
和
<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goals</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">4</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">8</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusive Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>
我正在尝试搜索两个字符串之间的差异。我需要返回第二个字符串,其中使用<mark>
标签突出显示所有差异。
这有点难以解释,所以这里有一些例子:
如果一个字符串包含文本<span>This is a string</span>
,第二个字符串包含<span>Thiss is a string</span>
,我想返回<span><mark>Thiss is a string</mark></span>
。
如果另一个字符串包含文本<p>36</p>
,第二个字符串包含<p>3</p>
,我想返回<p><mark>3</mark></p>
。
请注意,<mark>
标签是在距差异的左最近的>
后之后插入的,而</mark>
之前插入到差异的右侧之前。
我确定这是可能的,但是我似乎找不到找到实现此目的的方法。这是我到目前为止所拥有的:
<
很不幸地,skew=0
prev_i = []
highlighted_area_info = my_second_html_string
diff = difflib.ndiff(my_first_html_string, my_second_html_string)
for i,s in enumerate(diff, start=0):
if s[0]==' ':
continue
else:
if i in prev_i:
continue
count_right = my_second_html_string[i].find('<')
count_left = 0
for a, b in reversed(list(enumerate(my_second_html_string))):
if a < i:
if b == ">":
break
else:
count_left += 1
highlighted_area_info2 = highlighted_area_info[:i-count_left+skew]
highlighted_area_info2 += highlight_beginning
highlighted_area_info2 += highlighted_area_info[i-count_left+skew:i+count_right+skew]
highlighted_area_info2 += highlight_end
highlighted_area_info2 += highlighted_area_info[i+count_right+skew:]
skew += len(highlight_beginning)+len(highlight_end)
highlighted_area_info = highlighted_area_info2
prev_i = list(range(i-count_left+skew, i+count_right+skew))
print(highlighted_area_info)
和<mark>
标签插入的位置不正确,导致出现以下问题:</mark>
而不是我期望的<td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder"><mark>0</</ma<mark>rk>s</mark>pan></td>
。
我已经花了几天时间,但是我仍然不确定自己在做什么错,尽管显然有些事情是不对的。我的代码也可能没有利用最有效的方法来实现我的目标。
几天后我需要工作代码,因此,我们非常感谢您的帮助。
答案 0 :(得分:2)
我使用print()
测试代码中变量的值,发现使用ndiff(string1, string2)
但它需要ndiff(list_of_lines1, list_of_lines2)
-因此它将字符串视为字符列表,并进行比较每个字符。这样,它将为每个更改的字符放置<mark>
-而不是为整个单词放置一个<mark>
。
我尝试使用单行ndiff([string1], [string2])
和其他更改的列表来更改此设置,但最终我辞职了,因为这没有任何意义。您宁愿使用lxml
或Beautifulsoup
来解析HTML
并以tags
作为nodes
的树,然后比较{{1 }}。
我发现模块xmldiff使用text
,它为两个nodes
或lxml
生成更改列表。
XML
每个HTML
给出import xmldiff.main
all_changes = xmldiff.main.diff_texts(my_first_html_string, my_second_html_string)
,因此我使用change
查找节点并将xpath
替换为lxml
它可以找到不同的changes,但我只需要text
(当文本在标签内时-即<mark>text</mark>
)和UpdateTextIn
(当文本在标签后方-即。<a>new text</a>
UpdateTextAfter
之后,我再次将树转换为HTML
<a>...</a>new text
带有数据的最小工作示例
highlighted_tree = lxml.etree.fromstring(my_second_html_string)
for item in all_changes:
highlighted_node = highlighted_tree.xpath(item.node)[0]
if isinstance(item, xmldiff.actions.UpdateTextIn):
highlighted_node.text = '' # remove
highlighted_node.insert(0, lxml.etree.fromstring('<mark>' + item.text + '</mark>'))
if isinstance(item, xmldiff.actions.UpdateTextAfter):
highlighted_node.tail = '' # remove # has to be before addnext
highlighted_node.addnext(lxml.etree.fromstring('<mark>' + item.text + '</mark>'))
结果:
唯一的问题是,有时旧文本和新文本可能具有相同的文本,但空格,制表符,新行的数量不同,并且也被视为html = lxml.etree.tostring(highlighted_tree)
print(html.decode())
-而是会被跳过(但是需要其他代码)