我正在抓一个ePUB圣经(每章都有HTML页面),我想保留一些分布在HTML页面上的标签的顺序。 所有的圣经章节都是相似的,并且写得像这样或者是这样的变体:
<div>
<p class="p">
<a class="v">1</a> This is a verse.
<a class="v">2</a> This is a verse.
<a class="v">3</a> This is a verse.
</p>
<h3 class="s1">This is a pericope</h3>
<h4 class="r">This is a reference for this pericope.</h4>
<p class="p">
<a class="v">4</a> This is a verse.
<a class="v">5</a> This is a verse.
</p>
<p class="p">
<a class="v">6</a> This is a verse with a quote:
</p>
<p class="q">"This is the content</p>
<p class="q">of a quote;</p>
<p class="q">or a spoken word."</p>
<h3 class="s1">This is another pericope</h3>
...
</div>
本质:
<class="v"> is a "a" element with the number of the verse
<class="p"> is a "p" element with a verse or a collection of verses;
<class="q"> is a blockquote;
<class="s1"> is a pericope;
<class="r"> is a reference to a periscope;
获取所有p元素可以给我很好的结果,这与SO上的另一个相关问题相关联,但是可以废弃保留调用顺序的页面吗?
对于上面的示例,我将能够按顺序获取元素,因此我可以在文本中重写该章的内容:
1这是一节经文 2这是一节经文 3这是一节经文 这是一个pericope。
这是这个pericope的参考。
4这是一节经文 5这是一节经文 6这是一节引用的诗句:
“这是内容
引用;
或口语“
这是另一个潜望镜...
似乎无法找到使用Selenium的方法,但如果它会是什么呢?
答案 0 :(得分:0)
尝试此代码块并更新状态:
IList<IWebElement> elements = driver.FindElements(By.xpath("//*[(self::p) or (self::p and following-sibling::a) or (self::h3) or (self::h4)]"));
foreach (IWebElement element in elements)
{
string my_text = element.GetAttribute("innerHTML");
Console.WriteLine(my_text);
}
我的控制台上的输出如下:
1 This is a verse.
2 This is a verse.
3 This is a verse.
This is a pericope
This is a reference for this pericope.
4 This is a verse.
5 This is a verse.
6 This is a verse with a quote:
"This is the content
of a quote;
or a spoken word."
This is another pericope
请参阅xpath
是正确的,并返回正确的结果。这是快照: