如何从网页解析Gmail聊天记录?

时间:2010-06-30 17:39:09

标签: html xpath gmail chat logging

从显示该网页的网页解析Gmail聊天记录的最佳方法是什么?据我所知,这仍然是访问服务器托管的Gmail聊天记录的唯一途径(通过桌面版Gmail或移动版Gmail)。

当查看生成对话的生成源时,标记看起来像嵌套的div和spans(并且页面上其他地方的div已经随机化了两个字符的id和没有模式的类)。这是一个左侧有时间戳的行的摘录:

<div>
<span style="display:block;float:left;color:#888">
2:56 PM&nbsp;
</span>

<span style="display:block;padding-left:6em">
<span>

<span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs

</span>
</span>
</div>

但并非每一行都有时间戳,所以没有时间戳的人似乎在其位置放置了不间断的空格:

<div>
<span style="display:block;float:left;color:#888">
&nbsp;&nbsp;
</span>

<span style="display:block;padding-left:6em">

<span>
and reformat that into something like an xml format
</span>

</span>
</div>

我应该使用XPath吗?有没有更高效的东西?

编辑:

仅作为数据,这就是它的样子:

12:43 AM John: Something something something.
         Something something something.
         me: Something something something?
12:44 AM Also, something something something.
12:47 AM Something something something.
12:48 AM Something something something
         with something something something.
12:49 AM John: Something.

1 个答案:

答案 0 :(得分:1)

  

我应该使用XPath吗?有什么东西吗?   效率更高?

我会将Ruby与Nokogiri库一起使用,它比XPath / XSLT提供更多的灵活性:

#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'

src = <<EOS
<div>
    <span style="display:block;float:left;color:#888">
        2:56 PM&nbsp;
    </span>
    <span style="display:block;padding-left:6em">
        <span>
            <span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs
        </span>
    </span>
    <span style="display:block;float:left;color:#888">
        &nbsp;&nbsp;
    </span>
    <span style="display:block;padding-left:6em">
        <span>
            and reformat that into something like an xml format
        </span>
    </span>
</div>
EOS

chatlog = []
last_timestamp = nil
doc = Nokogiri::HTML(src)

doc.xpath('//div/span').each do |span|
    style = span.attributes['style'].value

    if style.include?('color:')
        last_timestamp = span.content.strip
    elsif style.include?('padding-left:')
        chatlog << {:timestamp => last_timestamp, :message => span.content.strip}
    end
end

builder = Nokogiri::XML::Builder.new do |doc|
    doc.chatlog {
        chatlog.each do |line|
            doc.line {
                doc.time    line[:timestamp]
                doc.message line[:message]
            }
        end
    }
end

返回:

<?xml version="1.0" encoding="UTF-8"?>
<chatlog>
  <line>
    <time>2:56 PM </time>
    <message>me: i'm trying to think of a good way to parse gmail chat logs</message>
  </line>
  <line>
    <time>  </time>
    <message>and reformat that into something like an xml format</message>
  </line>
</chatlog>