Question

如何将HTML表格（<table>）的内容转换为CSV格式？是否有库或linux程序执行此操作？这类似于Internet Explorer中的复制表，并将它们粘贴到Excel中。

Answer 1

此方法实际上不是库或程序，但对于即席转换，您可以

将表格的HTML放在名为 something.xls

text

使用电子表格打开
将其另存为CSV。

我知道这适用于Excel，我相信我已经使用OpenOffice电子表格完成了这项工作。

但你可能更喜欢Perl或Ruby脚本......

Answer 2

很抱歉复活了一个古老的线程，但我最近想要这样做，但我想要一个100％的便携式bash脚本来做到这一点。所以这是我的解决方案，只使用grep和sed。

下面很快就被淘汰了，所以可以做得更优雅，但我刚刚开始使用sed / awk等......

curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig'  | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

正如您所看到的，我使用curl获得了页面源代码，但您可以轻松地从其他地方输入表源。

以下是解释：

使用cURL获取URL的内容，将stderr转储为null（无进度表）

curl "http://www.webpagewithtableinit.com/" 2>/dev/null

我只想要Table元素（只返回带有TABLE，TR，TH，TD标记的行）

| grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'

删除行尾的任何空格。

| sed 's/^[\ \t]*//g'

删除换行符

| tr -d '\n\r'

用换行符替换</TR>

| sed 's/<\/TR[^>]*>/\n/Ig'

删除TABLE和TR标记

| sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig'

删除^<TD>，^<TH>，</TD>$，</TH>$

| sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig'

用逗号

替换</TD><TD>

| sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

请注意，如果任何表格单元格包含逗号，您可能需要先将其转义，或使用其他分隔符。

希望这有助于某人！

Answer 3

这是一个使用nokogiri的红宝石脚本 - http://nokogiri.rubyforge.org/nokogiri/

require 'nokogiri'

doc = Nokogiri::HTML(table_string)

doc.xpath('//table//tr').each do |row|
  row.xpath('td').each do |cell|
    print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
  end
  print "\n"
end

为我的基本测试用例工作。

Answer 4

这是我为完成此任务而编写的一个简短的Python程序。它是在几分钟内编写的，所以它可能会变得更好。不确定它将如何处理嵌套表（可能它会做坏事）或多个表（可能它们只是一个接一个地出现）。它无法处理colspan或rowspan。享受。

from HTMLParser import HTMLParser
import sys
import re


class HTMLTableParser(HTMLParser):
    def __init__(self, row_delim="\n", cell_delim="\t"):
        HTMLParser.__init__(self)
        self.despace_re = re.compile(r'\s+')
        self.data_interrupt = False
        self.first_row = True
        self.first_cell = True
        self.in_cell = False
        self.row_delim = row_delim
        self.cell_delim = cell_delim

    def handle_starttag(self, tag, attrs):
        self.data_interrupt = True
        if tag == "table":
            self.first_row = True
            self.first_cell = True
        elif tag == "tr":
            if not self.first_row:
                sys.stdout.write(self.row_delim)
            self.first_row = False
            self.first_cell = True
            self.data_interrupt = False
        elif tag == "td" or tag == "th":
            if not self.first_cell:
                sys.stdout.write(self.cell_delim)
            self.first_cell = False
            self.data_interrupt = False
            self.in_cell = True

    def handle_endtag(self, tag):
        self.data_interrupt = True
        if tag == "td" or tag == "th":
            self.in_cell = False

    def handle_data(self, data):
        if self.in_cell:
            #if self.data_interrupt:
            #   sys.stdout.write(" ")
            sys.stdout.write(self.despace_re.sub(' ', data).strip())
            self.data_interrupt = False


parser = HTMLTableParser() 
parser.feed(sys.stdin.read())

Answer 5

我不确定是否有预制的库，但是如果你愿意用一点Perl弄脏你的话，你很可能会对Text::CSV和{{3}做些什么。 }。

Answer 6

使用Perl，您可以使用HTML::TableExtract模块从表中提取数据，然后使用Text::CSV_XS创建CSV文件或Spreadsheet::WriteExcel创建Excel文件。

Answer 7

假设您已经设计了一个包含表格的html页面，我会推荐这个解决方案。对我来说就像魅力一样。

$(document).ready(function() {
$("#btnExport").click(function(e) {
    //getting values of current time for generating the file name
    var dt = new Date();
    var day = dt.getDate();
    var month = dt.getMonth() + 1;
    var year = dt.getFullYear();
    var hour = dt.getHours();
    var mins = dt.getMinutes();
    var postfix = day + "." + month + "." + year + "_" + hour + "." + mins;
    //creating a temporary HTML link element (they support setting file names)
    var a = document.createElement('a');
    //getting data from our div that contains the HTML table
    var data_type = 'data:application/vnd.ms-excel';
    var table_div = document.getElementById('dvData');
    var table_html = table_div.outerHTML.replace(/ /g, '%20');
    a.href = data_type + ', ' + table_html;
    //setting the file name
    a.download = 'exported_table_' + postfix + '.xls';
    //triggering the function
    a.click();
    //just in case, prevent default behaviour
    e.preventDefault();
});
});

礼貌：http://www.kubilayerdogan.net/?p=218

您可以在此处将文件格式编辑为.csv a.download ='exported_table_'+ postfix +'。csv';

Answer 8

只是为了添加这些答案（因为我最近尝试过类似的事情） - 如果 Google电子表格是您选择的电子投影程序。简单地做这两件事。

1。围绕表格开启/关闭代码围绕html文件中的所有内容并将其重新保存为另一个html文件。

2。将该html文件直接导入Google电子表格，您就可以将您的信息精美地导入（最重要提示：如果您在表格中使用了内联样式，那么它们将是也导入！）

节省了大量时间并计算出不同的转化次数。

Answer 9

这是一个没有任何外部库的简单解决方案：

https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/

它对我有用而没有任何问题

Answer 10

基于audiodude's answer，但使用内置CSV库进行了简化

require 'nokogiri'
require 'csv'

doc = Nokogiri::HTML(table_string)
csv = CSV.open("output.csv", 'w')

doc.xpath('//table//tr').each do |row|
    tarray = [] #temporary array
    row.xpath('td').each do |cell|
        tarray << cell.text #Build array of that row of data.
    end
    csv << tarray #Write that row out to csv file
end

csv.close

我确实想知道是否有任何方法可以采用Nokogiri NodeSet（row.xpath('td')）并将其作为数组写入csv文件中。但我只能通过迭代每个单元格并构建每个单元格内容的临时数组来做到这一点。

Answer 11

这里有几个选项

http://groups.google.com/group/ruby-talk-google/browse_thread/thread/cfae0aa4b14e5560?hl=nn

http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/

How can I scrape an HTML table to CSV?

https://addons.mozilla.org/en-US/firefox/addon/1852

Answer 12

这是一个非常古老的主题，但也许像我这样的人会碰到它。我为audiodude脚本添加了一些内容，用于从文件中读取html而不是将其添加到代码中，以及另一个控制标题行打印的参数。

脚本应该像那样运行

ruby <script_name> <file_name> [<print_headers>]

代码是：

require 'nokogiri'

print_header_lines = ARGV[1]

File.open(ARGV[0]) do |f|

  table_string=f
  doc = Nokogiri::HTML(table_string)

  doc.xpath('//table//tr').each do |row|
    if print_header_lines
      row.xpath('th').each do |cell|
        print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
      end
    end
    row.xpath('td').each do |cell|
      print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
    end
    print "\n"
  end
end

Answer 13

以下是使用pQuery和Spreadsheet::WriteExcel的示例：

use strict;
use warnings;

use Spreadsheet::WriteExcel;
use pQuery;

my $workbook = Spreadsheet::WriteExcel->new( 'data.xls' );
my $sheet    = $workbook->add_worksheet;
my $row = 0;

pQuery( 'http://www.blahblah.site' )->find( 'tr' )->each( sub{
    my $col = 0;
    pQuery( $_ )->find( 'td' )->each( sub{
        $sheet->write( $row, $col++, $_->innerHTML );
    });
    $row++;
});

$workbook->close;

该示例只是将它找到的所有 tr 标记提取到excel文件中。您可以轻松地定制它以获取特定的表，甚至可以按 table 标记触发新的Excel文件。

需要考虑的其他事项：

您可能需要选择 td 标记来创建Excel标题。
你可能遇到过rowspan＆amp ;;列跨度。

要查看是否正在使用rowspan或colspan，您可以：

pQuery( $data )->find( 'td' )->each( sub{ 
    my $number_of_cols_spanned = $_->getAttribute( 'colspan' );
});

Answer 14

OpenOffice.org可以查看HTML表格。只需在HTML文件上使用open命令，或在浏览器中选择并复制表格，然后在OpenOffice.org中选择Paste Special。它将查询您的文件类型，其中一个应该是HTML。选择那个，瞧！

Answer 15

这基于atomicules' answer，但更简洁，还处理th（标题）单元格以及td单元格。我还添加了strip方法来消除额外的空格。

CSV.open("output.csv", 'w') do |csv|
  doc.xpath('//table//tr').each do |row|
    csv << row.xpath('th|td').map {|cell| cell.text.strip}
  end
end

将代码包装在CSV块中可确保文件正确关闭。

如果您只是想要文本而不需要将其写入文件，则可以使用：

doc.xpath('//table//tr').inject('') do |result, row|
  result << row.xpath('th|td').map {|cell| cell.text.strip}.to_csv
end

Answer 16

这是Yuvai's answer的更新版本，可以正确处理需要引用的字段（即，数据中包含逗号，双引号或跨多行的字段）

#!/usr/bin/env python3
from html.parser import HTMLParser
import sys
import re

class HTMLTableParser(HTMLParser):
    def __init__(self, row_delim="\n", cell_delim=","):
        HTMLParser.__init__(self)
        self.despace_re = re.compile("\s+")
        self.data_interrupt = False
        self.first_row = True
        self.first_cell = True
        self.in_cell = False
        self.row_delim = row_delim
        self.cell_delim = cell_delim
        self.quote_buffer = False
        self.buffer = None

    def handle_starttag(self, tag, attrs):
        self.data_interrupt = True
        if tag == "table":
            self.first_row = True
            self.first_cell = True
        elif tag == "tr":
            if not self.first_row:
                sys.stdout.write(self.row_delim)
            self.first_row = False
            self.first_cell = True
            self.data_interrupt = False
        elif tag == "td" or tag == "th":
            if not self.first_cell:
                sys.stdout.write(self.cell_delim)
            self.first_cell = False
            self.data_interrupt = False
            self.in_cell = True
        elif tag == "br":
            self.quote_buffer = True
            self.buffer += self.row_delim

    def handle_endtag(self, tag):
        self.data_interrupt = True
        if tag == "td" or tag == "th":
            self.in_cell = False
        if self.buffer != None:
            # Quote if needed...
            if self.quote_buffer or self.cell_delim in self.buffer or "\"" in self.buffer:
                # Need to quote! First, replace all double-quotes with quad-quotes
                self.buffer = self.buffer.replace("\"", "\"\"")
                self.buffer = "\"{0}\"".format(self.buffer)
            sys.stdout.write(self.buffer)
            self.quote_buffer = False
            self.buffer = None

    def handle_data(self, data):
        if self.in_cell:
            #if self.data_interrupt:
            #   sys.stdout.write(" ")
            if self.buffer == None:
                self.buffer = ""
            self.buffer += self.despace_re.sub(" ", data).strip()
            self.data_interrupt = False

parser = HTMLTableParser() 
parser.feed(sys.stdin.read())

此脚本的一项增强功能是添加对指定不同行分隔符（或自动计算平台正确的行分隔符）和不同列分隔符的支持。

Answer 17

读取 HTML 文件并使用 Ruby 的 `CSV` 和 `nokogiri` 输出到 `.csv`。

基于@audiodude's answer但做了以下修改：

从文件中读取以获取 HTML。这对于长 HTML 表格很方便，但如果您的 HTML 表格很小，则可以轻松修改为仅使用静态字符串。
使用 CSV 的内置库将 Array 转换为 CSV 行。
输出到 .csv 文件，而不只是打印到 STDOUT。
获取表格标题 (th) 和表格正文 (td)。

# Convert HTML table to CSV format.

require "nokogiri"

html_file_path = ""

html_string = File.read( html_file_path )

doc = Nokogiri::HTML( html_string )

CSV.open( Rails.root.join( Time.zone.now.to_s( :file ) + ".csv" ), "wb" ) do |csv|
  doc.xpath( "//table//tr" ).each do |row|
    csv << row.xpath( "th|td" ).collect( &:text ).collect( &:strip )
  end
end

如何将HTML表格转换为CSV？

17 个答案:

读取 HTML 文件并使用 Ruby 的 `CSV` 和 `nokogiri` 输出到 `.csv`。

如何将HTML表格转换为CSV？

17 个答案:

读取 HTML 文件并使用 Ruby 的 CSV 和 nokogiri 输出到 .csv。

读取 HTML 文件并使用 Ruby 的 `CSV` 和 `nokogiri` 输出到 `.csv`。