我需要将CSS id插入到HTML文档中以标记段落和句子。
有许多不同的格式化HTML的方法,很难找到一致的方法来解析它们。例如,一些糟糕的html使用<table>
,其他人使用<P>
,其他人使用<div>
等。有些使用组合。
INPUT:
<p> This is a sentence, with stuff. Mr. John doe was walking down the street. Mrs. Daisy knows how to drive but does not drive. The car is fast, but is an ugly color. This is an example of a paragraph. </P>
<br>
<div> However, sometimes, paragraphs on HTML pages are not tagged as with a consistent format. This makes it hard to identify paragraphs and sentences. I need a solution to tag them with CSS id's</div>
输出
<p><span id="paragraph1"> <span id="sentence1">This is a sentence, with stuff.</span><span id="sentence2"> Mr. John doe was walking down the street. </span><span id="sent3"> Mrs. Daisy knows how to drive but does not drive. </span> <span id="sent4"> The car is fast, but is an ugly color.</span> <span id="sent4"> This is an example of a paragraph.</span> </span> </P>
</br>
<div><span id="paragraph2"> <span id="sent5">However, sometimes, paragraphs on HTML pages are not tagged as with a consistent format.</span><span id="sent6"> This makes it hard to identify paragraphs and sentences.</span> <span id="sent7"> I need a solution to tag them with CSS id's</span></span></div>
1)可以使用什么解决方案来识别HTML中的段落并标记它们。
2)OpenNLP很适合识别句子,但我没有看到一个html脱衣舞。
我以为我可以使用Tika剥离HTML并将其输入OpenNLP以识别句子,但是我丢失了所有格式,并且不知道将标记放回原始HTML的位置。