如何解析特定单词后的数据

时间:2016-05-31 10:27:18

标签: ruby nokogiri mechanize

我有一个HTML文档:

async Task<TcpClient> ConnectSslStreamAsync(string host, int port, CancellationToken ct)
{
    var client = await ConnectAsync(host, port, ct);
    try
    {
        ct.ThrowIfCancellationRequested();

        var sslStream = new SslStream(client.getStream());
        using (ct.Register(sslStream.Close))
        {
            try
            {
                await sslStream.AuthenticateAsClientAsync(...);
            }
            catch (Exception)
            {
                // These exceptions are likely because we closed the
                // stream with ct.Register().  Convert them to
                // OperationCancelledException if that's the case
                ct.ThrowIfCancellationRequested();
                throw;
            }
        }

        // Pick up strugglers here because ct.Register() may have hosed our stream
        ct.ThrowIfCancellationRequested();
    }
    catch (Exception)
    {
        client.Close();
        throw;
    }

    return client;
}

对于解析我使用:

<div class="info">
  Country:
  <b>UK</b>
  <br>
  City:
  <b>London</b>
  <br>
  Name:
  <b>Jon</b>
  <br>
  Date:
  <b>12.08.2014</b>
  <br>
</div>

此代码不好,因为元素的顺序和数量可能会有所不同。

如何在特定单词后解析数据?

UPD: In Nokogiri we can use JS selectors。但在我的情况下,无论如何只解析第一个元素。

name = review_meta.search('.info b')[2].text
country = review_meta.search('.info b')[0].text
city = review_meta.search('.info b')[1].text
data = review_meta.search('.info b')[3].text

我该如何解决这个问题?

3 个答案:

答案 0 :(得分:1)

如何用经典的regexp解析它:

h = {}
str = review_meta.search('.info')[0].text
str.gsub(/[\n]+/, '').split('<br>').reject { |item| item == '' }.each do |item|
  match = item.match(/([a-zA-Z]+):<b>([a-zA-Z0-9\.]+)<.b>/)
  h[match[1].downcase.to_sym] = match[2]
end

p h
=> {:country=>"UK", :city=>"London", :name=>"Jon", :date=>"12.08.2014"}

答案 1 :(得分:1)

  

......元素的顺序和数量可能会有所不同......

如果你不能指望文本的顺序或结构,那么你必须做一些事情来分解它直到它可用。

如果我考虑的时间更长,我可能会写一些更高效的问题,但这就是我要开始的地方:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div class="info">
  Country:
  <b>UK</b>
  <br>
  City:
  <b>London</b>
  <br>
  Name:
  <b>Jon</b>
  <br>
  Date:
  <b>12.08.2014</b>
  <br>
</div>
EOT

hash = doc.at('.info').text # => "\n  Country:\n  UK\n  \n  City:\n  London\n  \n  Name:\n  Jon\n  \n  Date:\n  12.08.2014\n  \n"
                      .strip # => "Country:\n  UK\n  \n  City:\n  London\n  \n  Name:\n  Jon\n  \n  Date:\n  12.08.2014"
                      .gsub(/\n +/, "\n") # => "Country:\nUK\n\nCity:\nLondon\n\nName:\nJon\n\nDate:\n12.08.2014"
                      .gsub(/:\n/, ':') # => "Country:UK\n\nCity:London\n\nName:Jon\n\nDate:12.08.2014"
                      .gsub(/\n\n/, ' ') # => "Country:UK City:London Name:Jon Date:12.08.2014"
                      .split  # => ["Country:UK", "City:London", "Name:Jon", "Date:12.08.2014"]
                      .map{ |s|
                        a, b = s.split(':')
                        [a.downcase, b]
                      } # => [["country", "UK"], ["city", "London"], ["name", "Jon"], ["date", "12.08.2014"]]
                      .to_h # => {"country"=>"UK", "city"=>"London", "name"=>"Jon", "date"=>"12.08.2014"}

hash['date'] # => "12.08.2014"

它将标签和值分解为哈希值,此时您可以轻松获取单个值。

答案 2 :(得分:1)

你可以使用xpath,或者类似的东西:

doc.search('.info').children.find{|x| x.text['City:']}.next.text
#=> "London"
doc.search('.info').children.find{|x| x.text['Name:']}.next.text
#=> "Jon"

您希望避免其他解决方案,使用正则表达式解析HTML是最后的选择。