Question

我正在抓取一张最终导出为CSV格式的表格。我可能需要考虑几种情况，例如嵌套表，跨越行/单元格等，但是现在我只是忽略这些情况并假设我有一个非常简单的表。 “简单”是指我们只有行和单元格，每行可能有不等数量的单元格，但它仍然是一个相当基本的结构。

<table>
  <tr>
    <td>text </td>
    <td>text </td>
  </tr>
  <tr>
    <td>text </td>
  </tr>
</table>

我的方法是简单地遍历行和列

String[] rowTxt;
WebElement table = driver.findElement(By.xpath(someLocator));
for (WebElement rowElmt : table.findElements(By.tagName("tr")))
{
    List<WebElement> cols = rowElmt.findElements(By.tagName("td"));
    rowTxt = new String[cols.size()];
    for (int i = 0; i < rowTxt.length; i++)
    {
        rowTxt[i] = cols.get(i).getText();
    }
}

但是，这很慢。对于包含218行的CSV文件（这意味着，我的表有218行），每行不超过5列，刮掉表格需要45秒。

我试图通过在行元素上使用getText来避免遍历每个单元格，希望输出将被某些东西分隔，但事实并非如此。

有没有更好的方法刮一张桌子？

Answer 1

我使用Jsoup而不是使用selenium来解析HTML。虽然Selenium提供了遍历表的功能，但Jsoup效率更高。我决定仅将Selenium用于网页自动化，并将所有解析任务委托给Jsoup。

我的方法如下

获取所需元素的HTML源代码
将其作为字符串传递给Jsoup进行解析

我最终编写的代码与selenium版本非常相似

String source = "<table>" + driver.findElement(By.xpath(locator)).getAttribute("innerHTML") + "<table>";
Document doc = Jsoup.parse(source, "UTF-8");
for (Element rowElmt : doc.getElementsByTag("tr"))
{
    Elements cols = rowElmt.getElementsByTag("th");
    if (cols.size() == 0 )
        cols = rowElmt.getElementsByTag("td");

    rowTxt = new String[cols.size()];
    for (int i = 0; i < rowTxt.length; i++)
    {
        rowTxt[i] = cols.get(i).text();
    }
    csv.add(rowTxt);
}

Selenium解析器需要5分钟读取1000行表，而Jsoup解析器需要不到10秒。虽然我没有花太多时间进行基准测试，但我对结果非常满意。

Answer 2

无论您使用xpath，id还是css来确定您的位置，最明确的是速度很慢。也就是说，如果您使用pageObject模式，则可以使用@CacheLookup注释。来自消息来源：

默认情况下，每次调用方法时都会查找元素或列表。
要更改此行为，只需使用{@link CacheLookup}

我使用100行和6列的表进行了测试，测试查询了每个td元素的文本。没有@CacheLookup所花费的时间（元素由XPath定位） 40秒。使用缓存查找，它下降到大约。 20秒，但仍然太多了。

无论如何，如果你丢失了firefox驱动程序并运行测试无头（使用htmlUnit），速度会急剧增加。运行相同的测试无头，时间在100-200ms之间，所以它甚至可能比Jsoup更快。

您可以查看/试用我的测试代码here。

Answer 3

我使用HtmlAgilityPack作为Nuget安装来解析动态html表。它非常快，按this answer你可以使用linq查询结果。我已将其用于将结果存储为DataTable。这是公共扩展方法类： -

public static class HtmlTableExtensions
{
    private static readonly ILog Log = LogManager.GetLogger(typeof(HtmlTableExtensions));

    /// <summary>
    ///     based on an idea from https://stackoverflow.com/questions/655603/html-agility-pack-parsing-tables
    /// </summary>
    /// <param name="tableBy"></param>
    /// <param name="driver"></param>
    /// <returns></returns>
    public static HtmlTableData GetTableData(this By tableBy, IWebdriverCore driver)
    {
        try
        {
            var doc = tableBy.GetTableHtmlAsDoc(driver);
            var columns = doc.GetHtmlColumnNames();
            return doc.GetHtmlTableCellData(columns);
        }
        catch (Exception e)
        {
            Log.Warn(String.Format("unable to get table data from {0} using driver {1} ",tableBy ,driver),e);
            return null;
        }
    }

    /// <summary>
    ///     Take an HtmlTableData object and convert it into an untyped data table,
    ///     assume that the row key is the sole primary key for the table,
    ///     and the key in each of the rows is the column header
    ///     Hopefully this will make more sense when its written!
    ///     Expecting overloads for swichting column and headers,
    ///     multiple primary keys, non standard format html tables etc
    /// </summary>
    /// <param name="htmlTableData"></param>
    /// <param name="primaryKey"></param>
    /// <param name="tableName"></param>
    /// <returns></returns>
    public static DataTable ConvertHtmlTableDataToDataTable(this HtmlTableData htmlTableData,
        string primaryKey = null, string tableName = null)
    {
        if (htmlTableData == null) return null;
        var table = new DataTable(tableName);

        foreach (var colName in htmlTableData.Values.First().Keys)
        {
            table.Columns.Add(new DataColumn(colName, typeof (string)));
        }
        table.SetPrimaryKey(new[] { primaryKey });
        foreach (var values in htmlTableData
            .Select(row => row.Value.Values.ToArray<object>()))
        {
            table.Rows.Add(values);
        }

        return table;
    }


    private static HtmlTableData GetHtmlTableCellData(this HtmlDocument doc, IReadOnlyList<string> columns)
    {
        var data = new HtmlTableData();
        foreach (
            var rowData in doc.DocumentNode.SelectNodes(XmlExpressions.AllDescendants + HtmlAttributes.TableRow)
                .Skip(1)
                .Select(row => row.SelectNodes(HtmlAttributes.TableCell)
                    .Select(n => WebUtility.HtmlDecode(n.InnerText)).ToList()))
        {
            data[rowData.First()] = new Dictionary<string, string>();
            for (var i = 0; i < columns.Count; i++)
            {
                data[rowData.First()].Add(columns[i], rowData[i]);
            }
        }
        return data;
    }

    private static List<string> GetHtmlColumnNames(this HtmlDocument doc)
    {
        var columns =
            doc.DocumentNode.SelectNodes(XmlExpressions.AllDescendants + HtmlAttributes.TableRow)
                .First()
                .SelectNodes(XmlExpressions.AllDescendants + HtmlAttributes.TableHeader)
                .Select(n => WebUtility.HtmlDecode(n.InnerText).Trim())
                .ToList();
        return columns;
    }

    private static HtmlDocument GetTableHtmlAsDoc(this By tableBy, IWebdriverCore driver)
    {
        var webTable = driver.FindElement(tableBy);
        var doc = new HtmlDocument();
        doc.LoadHtml(webTable.GetAttribute(HtmlAttributes.InnerHtml));
        return doc;
    }
}

html数据对象只是字典的扩展名： -

public class HtmlTableData : Dictionary<string,Dictionary<string,string>>
{

}

IWebdriverCore驱动程序是IWebDriver或IRemoteWebdriver的包装器，它将这些接口中的任何一个公开为readonly属性，但您可以用IWebDriver替换它。

HtmlAttributes是一个静态的lass，它保存常见html属性的const值，以便在c＃代码中引用html元素/属性/标签等时保存错别字： -

/// <summary>
/// config class holding common Html Attributes and tag names etc
/// </summary>
public static class HtmlAttributes
{
    public const string InnerHtml = "innerHTML";
    public const string TableRow = "tr";
    public const string TableHeader = "th";
    public const string TableCell = "th|td";
    public const string Class = "class";

... }

和SetPrimaryKey是DataTable的扩展，它允许轻松设置数据表的主键： -

    public static void SetPrimaryKey(this DataTable table,string[] primaryKeyColumns)
    {
        int size = primaryKeyColumns.Length;
        var keyColumns = new DataColumn[size];
        for (int i = 0; i < size; i++)
        {
            keyColumns[i] = table.Columns[primaryKeyColumns[i]];
        }
        table.PrimaryKey = keyColumns;
    }

我发现这是相当高效的 - ＆lt; 2毫秒来解析一个30 * 80的表，并且它可以轻松使用。

使用SeleniumDriver在给定表元素的情况下提取所有行和列

3 个答案: