我有什么方法可以使用AngleSharp来计算所有正文标记,然后为所有标记分配一个唯一的属性ID,例如“data-id = 1”,“data-id = 2”等标签
我希望测试这个用于静态网站翻译目的,然后我将使用查询选择器[data-id]并从每个标签获取TextContent并使用翻译API翻译文本并将翻译后的文本设置为标记ID从最初提取的位置。
C#
var html = File.ReadAllText(@"C:\example.html");
var parser = new HtmlParser();
var document = parser.Parse(html);
var elements = document.All.Where(o => o.NodeType == AngleSharp.Dom.NodeType.Text && o.TextContent.Trim() != ""); // If text, assign id.
if(elements != null)
{
int number = 0;
foreach(var element in elements)
{
element.SetAttribute("data-id", number.ToString());
number++;
element.OuterHtml.Dump();
}
}
HTML
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1>My page heading</h1>
<h2>This is example static page to get all the HTML tags and their <strong>childrens content </strong> and then <span>translate</span>
that into </br> another language.
</h2>
<p>Something in footer</p>
</body>
</html>
答案 0 :(得分:0)
似乎如下所示。基本上,它从Body元素收集所有降序元素,循环遍历它们并添加属性。也许您无法遍历所有后代。
我将正文的更新后的内部html写到一个文本文件中,以便您看到其中的'data-id'属性。
class Program
{
static void Main(string[] args)
{
var response = getHtml("http://www.bbc.com");
var html = response.Result;
HtmlParser htmlParser = new HtmlParser();
var parsedDoc = htmlParser.Parse(html);
var body = parsedDoc.Body;
var elements = getAllElements(parsedDoc.Body);
for(var i = 0; i < elements.Count; i++)
{
var child = elements[i];
child.SetAttribute("data-id", $"data-id{i + 1}");
}
File.WriteAllText("E:/soQuestion.txt", parsedDoc.Body.InnerHtml);
}
static async Task<string> getHtml(string url)
{
using (var httpClient = new HttpClient())
{
var response = await httpClient.GetAsync(url);
//if http request did not succeeed, return empty html
if (!response.IsSuccessStatusCode) return string.Empty;
var content = await response.Content.ReadAsStringAsync();
return content;
}
}
static List<IElement> getAllElements(IElement element)
{
List<IElement> elements = new List<IElement>();
//add element itself
elements.Add(element);
foreach (var child in element.Children)
{
//add each child elements
elements.AddRange(getAllElements(child));
}
return elements;
}
}