Question

我有一个包含超过100个html文件的目录。我只需要提取<TITLE></TITLE>和<BODY></BODY>标记内的内容，然后将其格式化为：

TITLE，“BODY CONTENT”（每个文件一行）

如果数组中每个文件的结果都可以写入1个巨型文本文件，那将是有益的。我找到以下命令将文档格式化为一行：

grep '^[^<]' test.txt | tr -d '\n' > test.txt

虽然没有特别的编程语言，但如果我需要进一步修改它，以下内容将会有所帮助：perl，shell（.sh），sed

Answer 1

这是Ruby使用Nokogiri的东西。

require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'

ARGV.each do |input_filename|
  doc = Nokogiri::HTML(File.read(input_filename))
  title, body = doc.title, doc.xpath('//body').inner_text
  puts %Q(#{title}, "#{body}")
end

将其保存到.rb文件，例如extractor.rb。然后，您需要通过运行gem install nokogiri来确保安装Nokogiri。

像这样使用这个脚本：

ruby extractor.rb /path/to/yourhtmlfiles/*.html > out.txt

请注意，我不会在此脚本中处理换行符，但您似乎已经弄明白了。

<强>更新

这次它会删除换行符和开始/结束空格。

require 'rubygems' # This line isn't needed on Ruby 1.9
require 'nokogiri'

ARGV.each do |input_filename|
  doc = Nokogiri::HTML(File.read(input_filename))
  title, body = doc.title, doc.xpath('//body').inner_text.gsub("\n", '').strip
  puts %Q(#{title}, "#{body}")
end

Answer 2

你可以用C＃和LINQ做到这一点。加载文件的快速示例：

    IDictionary<string, string> parsed = new Dictionary<string, string>();

    foreach ( string file in Directory.GetFiles( @"your directory here" ) )
    {
        var html = XDocument.Load( "file path here" ).Element( "html" );

        string title = html.Element( "title" ).Value;
        string body = html.Element( "body" ).Value;
        body = XElement.Parse( body ).ToString( SaveOptions.DisableFormatting );

        parsed.Add( title, body );
    }

    using ( StreamWriter file = new StreamWriter( @"your file path") )
    {
        foreach ( KeyValuePair<string, string> pair in parsed )
        {
            file.WriteLine( string.Format( "{0}, \"{1}\"", pair.Key, pair.Value ) );
        }
    }

我没有测试过这个特殊的代码块，但它应该可以工作。 HTH。

编辑：如果您有基目录路径，则可以使用Directory.GetFiles()检索目录中的文件名。

从html标签中提取内容

2 个答案: