标题本身并不是一个明确的问题,所以我将提供一个例子:
我有一个示例字符串:
Created and managed websites for clients to communicate securely
并且有很多“版本”。这意味着字符串“版本”中的单词或短语将包含在HTML div标签中,即<div style="font-size: 0.1000000">foo bar</div>
。 (这些标签是任意的,给予font-size属性的数字对应于以后将被用作其他现在无关的CSS特征的分数。)以下是该字符串的4个版本:
Created and <div style="font-size: 1">managed</div> websites for clients to communicate securely
Created and <div style="font-size: 2">managed websites</div> for clients to communicate securely
Created and managed websites for clients to <div style="font-size: 3">communicate</div> securely
<div style="font-size: 4">Created</div> and managed websites for clients to communicate securely
我想将所有这些版本合并到此:
<div style="font-size: 4">Created</div> and <div style="font-size: 2"><div style="font-size: 1">managed</div> websites</div> for clients to <div style="font-size: 3">communicate</div> securely
正如我们在此处看到的,有重叠的标签(标签中包含font-size: 2
和font-size: 1
)。字符串的版本数量可以在1到50之间,因此可能存在多个重叠。
以下是我到目前为止使用的正则表达式:
import re
div_str = "<div style=.*</div>" # the div tags
div_text_str = "(?<=(>)).*(?=(</div>))" # the content inside the div tags
# compile the regexes
div_regex = re.compile(div_str)
div_text_regex = re.compile(div_text_str)
def merge_strings(str1, str2):
# grab the div tag off the first version
div = div_regex.search(str1).group()
# grab the contents of that div tag
div_text = div_text_regex.search(div).group()
# find the div content in the second version, then substitute
# with the div tag
return re.sub(div_text, div, str2)
我在循环中运行此函数并尝试一次合并2个字符串,直到我得到最终输出。我面临的问题是重叠标签不能与此功能一起使用,因为正则表达式模式与它不匹配。此外,一次替换多个div标签失败。
对此有任何帮助将不胜感激!
答案 0 :(得分:0)
这是不正确答案。
我要提到用正则表达式解析HTML通常会让生活变得不必要。最好使用一个解析器,如BeautifulSoup,lxml,scrapy等。
很容易从您提供的每一行中恢复文本作为示例。我假设每个都是更大建筑的一部分;因此,我将每个都包含在div
内。
在这里,我使用BeautifulSoup从你的每一行获取文本。
>>> for line in open('temp.htm').readlines():
... line = line.strip()
... print(line)
... soup = bs4.BeautifulSoup(line, 'lxml')
... soup.find('div').text
...
<div>Created and <div style="font-size: 1">managed</div> websites for clients to communicate securely</div>
'Created and managed websites for clients to communicate securely'
<div>Created and <div style="font-size: 2">managed websites</div> for clients to communicate securely</div>
'Created and managed websites for clients to communicate securely'
<div>Created and managed websites for clients to <div style="font-size: 3">communicate</div> securely</div>
'Created and managed websites for clients to communicate securely'
<div><div style="font-size: 4">Created</div> and managed websites for clients to communicate securely</div>
'Created and managed websites for clients to communicate securely'
不幸的是,我不明白你是如何将输入行映射到输出HTML的。
答案 1 :(得分:0)
我明白了。用BeautifulSoup替换正则表达式使解析变得更容易,我根据div标签之间的文本长度对这些版本进行了排序,以免遇到查找子字符串的任何问题。
使用相同的样本:
Created and <div style="font-size: 1">managed</div> websites for clients to communicate securely
Created and <div style="font-size: 2">managed websites</div> for clients to communicate securely
Created and managed websites for clients to <div style="font-size: 3">communicate</div> securely
<div style="font-size: 4">Created</div> and managed websites for clients to communicate securely
这些行以列表形式表示,然后使用BeautifulSoup按其对应的div标签之间的文本长度进行排序。这是代码:
def __merge_strings(final_str, version):
soup = BeautifulSoup(final_str, "html.parser")
for fixed_div in soup.find_all("div"):
if not fixed_div.text == version.text:
return final_str.replace(
version.text, unicode(version)
)
return final_str
found_terms = (
(i, BeautifulSoup(i, "html.parser").find("div"))
for i in found_terms
) # list of pairs of the version and its div text
found_terms = sorted(
found_terms, key=lambda x: len(x[-1].text), reverse=True
) # sort on the length of the div text to avoid issues with substrings
current_div = found_terms[0][0] # version with the largest div text
for i in xrange(1, len(found_terms)):
current_div = __merge_strings(current_div, found_terms[i][-1])