突出显示两个html字符串之间的差异

时间:2020-09-07 00:50:47

标签: python html python-3.x

我有2个具有多个细微差别的HTML字符串:

<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goal4s</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">1</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">9</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusivey Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">7</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">1</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmsasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>

<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goals</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">4</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">8</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusive Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>

我正在尝试搜索两个字符串之间的差异。我需要返回第二个字符串,其中使用<mark>标签突出显示所有差异。

这有点难以解释,所以这里有一些例子:

如果一个字符串包含文本<span>This is a string</span>,第二个字符串包含<span>Thiss is a string</span>,我想返回<span><mark>Thiss is a string</mark></span>。 如果另一个字符串包含文本<p>36</p>,第二个字符串包含<p>3</p>,我想返回<p><mark>3</mark></p>

请注意,<mark>标签是在距差异的最近的>后之后插入的,而</mark> 之前插入到差异的右侧之前。

我确定这是可能的,但是我似乎找不到找到实现此目的的方法。这是我到目前为止所拥有的:

<

很不幸地,skew=0 prev_i = [] highlighted_area_info = my_second_html_string diff = difflib.ndiff(my_first_html_string, my_second_html_string) for i,s in enumerate(diff, start=0): if s[0]==' ': continue else: if i in prev_i: continue count_right = my_second_html_string[i].find('<') count_left = 0 for a, b in reversed(list(enumerate(my_second_html_string))): if a < i: if b == ">": break else: count_left += 1 highlighted_area_info2 = highlighted_area_info[:i-count_left+skew] highlighted_area_info2 += highlight_beginning highlighted_area_info2 += highlighted_area_info[i-count_left+skew:i+count_right+skew] highlighted_area_info2 += highlight_end highlighted_area_info2 += highlighted_area_info[i+count_right+skew:] skew += len(highlight_beginning)+len(highlight_end) highlighted_area_info = highlighted_area_info2 prev_i = list(range(i-count_left+skew, i+count_right+skew)) print(highlighted_area_info) <mark>标签插入的位置不正确,导致出现以下问题:</mark> 而不是我期望的<td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder"><mark>0</</ma<mark>rk>s</mark>pan></td>

我已经花了几天时间,但是我仍然不确定自己在做什么错,尽管显然有些事情是不对的。我的代码也可能没有利用最有效的方法来实现我的目标。

几天后我需要工作代码,因此,我们非常感谢您的帮助。

1 个答案:

答案 0 :(得分:2)

我使用print()测试代码中变量的值,发现使用ndiff(string1, string2)但它需要ndiff(list_of_lines1, list_of_lines2)-因此它将字符串视为字符列表,并进行比较每个字符。这样,它将为每个更改的字符放置<mark>-而不是为整个单词放置一个<mark>

我尝试使用单行ndiff([string1], [string2])和其他更改的列表来更改此设置,但最终我辞职了,因为这没有任何意义。您宁愿使用lxmlBeautifulsoup来解析HTML并以tags作为nodes的树,然后比较{{1 }}。


我发现模块xmldiff使用text,它为两个nodeslxml生成更改列表。

XML

每个HTML给出import xmldiff.main all_changes = xmldiff.main.diff_texts(my_first_html_string, my_second_html_string) ,因此我使用change查找节点并将xpath替换为lxml

它可以找到不同的changes,但我只需要text(当文本在标签内时-即<mark>text</mark>)和UpdateTextIn(当文本在标签后方-即。<a>new text</a>

UpdateTextAfter

之后,我再次将树转换为HTML

<a>...</a>new text

带有数据的最小工作示例

highlighted_tree = lxml.etree.fromstring(my_second_html_string)

for item in all_changes:

    highlighted_node = highlighted_tree.xpath(item.node)[0]

    if isinstance(item, xmldiff.actions.UpdateTextIn):
        highlighted_node.text = '' # remove
        highlighted_node.insert(0, lxml.etree.fromstring('<mark>' + item.text + '</mark>'))

    if isinstance(item, xmldiff.actions.UpdateTextAfter):
        highlighted_node.tail = '' # remove # has to be before addnext
        highlighted_node.addnext(lxml.etree.fromstring('<mark>' + item.text + '</mark>'))

结果:

enter image description here


唯一的问题是,有时旧文本和新文本可能具有相同的文本,但空格,制表符,新行的数量不同,并且也被视为html = lxml.etree.tostring(highlighted_tree) print(html.decode()) -而是会被跳过(但是需要其他代码)