Ruby& MySQL:如何在解析XML文件时处理缺少的元素

时间:2014-02-26 18:00:47

标签: mysql ruby xml

目前我正在尝试解析大型xml文件,以下是我的xml文件的样子:

<post>
  <row Id="22" PostTypeId="2" ParentId="9" CreationDate="2008-08-01T12:07:19.500" Score="7" Body="&lt;p&gt;The best way that I know of because of leap years and everything is:&lt;/p&gt;&#xD;&#xA;&#xD;&#xA;&lt;pre&gt;&lt;code&gt;DateTime birthDate = new DateTime(2000,3,1);&lt;br&gt;int age = (int)Math.Floor((DateTime.Now - birthDate).TotalDays / 365.25D);&lt;br&gt;&lt;/code&gt;&lt;/pre&gt;&#xD;&#xA;&#xD;&#xA;&lt;p&gt;Hope this helps.&lt;/p&gt;" OwnerUserId="17" LastEditorUserId="17" LastEditorDisplayName="Nick" LastEditDate="2008-08-01T15:26:37.087" LastActivityDate="2008-08-01T15:26:37.087" CommentCount="1" CommunityOwnedDate="2011-08-16T19:40:43.080" />

  <row Id="29" PostTypeId="2" ParentId="13" CreationDate="2008-08-01T12:19:17.417" Score="18" Body="&lt;p&gt;There are no HTTP headers that will report the clients timezone so far although it has been suggested to include it in the HTTP specification.&lt;/p&gt;&#xD;&#xA;&#xD;&#xA;&lt;p&gt;If it was me, I would probably try to fetch the timezone using clientside JavaScript and then submit it to the server using Ajax or something.&lt;/p&gt;" OwnerUserId="19" LastActivityDate="2008-08-01T12:19:17.417" CommentCount="0" />

</post>

此XML文件中的这两个记录之间的差异是没有LastEditDate元素。我相信因此我得到以下错误:

/ruby/1.9.2/ubuntuamd1/lib/ruby/1.9.1/date/format.rb:1031:in `dup': can't dup NilClass (TypeError)
    from /soft/ruby/1.9.2/ubuntuamd1/lib/ruby/1.9.1/date/format.rb:1031:in `_parse'
    from /soft/ruby/1.9.2/ubuntuamd1/lib/ruby/1.9.1/date.rb:1732:in `parse'
    from load.rb:105:in `on_start_element'
    from load.rb:165:in `parse'

以下是其引用的代码段:

if element == 'row'
  @post_st.execute(attributes['Id'], attributes['PostTypeId'], attributes['AcceptedAnswerId'], attributes['ParentId'], attributes['Score'], attributes['ViewCount'], 
    attributes['Body'], attributes['OwnerUserId'] == nil ? -1 : attributes['OwnerUserId'], attributes['LastEditorUserId'], attributes['LastEditorDisplayName'], 
    DateTime.parse(attributes['LastEditDate']).to_time.strftime("%F %T"), DateTime.parse(attributes['LastActivityDate']).to_time.strftime("%F %T"), attributes['Title'] == nil ? '' : attributes['Title'], 
    attributes['AnswerCount'] == nil ? 0 : attributes['AnswerCount'], attributes['CommentCount'] == nil ? 0 : attributes['CommentCount'], 
    attributes['FavoriteCount'] == nil ? 0 : attributes['FavoriteCount'], DateTime.parse(attributes['CreationDate']).to_time.strftime("%F %T"))
  post_id = attributes['Id']

此外,我认为这是我寻找LastEditDate

的行
 DateTime.parse(attributes['LastEditDate']).to_time.strftime("%F %T"), DateTime.parse(attributes['LastActivityDate']).to_time.strftime("%F %T"), attributes['Title'] == nil ? '' : attributes['Title']

我猜因为元素不存在我得到了上面提到的错误。我想知道如何处理这种情况,如果元素不存在,则将其设置为默认值。因为在解析这些记录时我将它们插入到MySQL数据库中。其中有以下表结构:

+--------------------------+--------------+------+-----+---------------------+-----------------------------+
| Field                    | Type         | Null | Key | Default             | Extra                       |
+--------------------------+--------------+------+-----+---------------------+-----------------------------+
| id                       | int(11)      | NO   | PRI | NULL                |                             |
| post_type_id             | int(11)      | NO   |     | NULL                |                             |
| accepted_answer_id       | int(11)      | YES  |     | NULL                |                             |
| parent_id                | int(11)      | YES  | MUL | NULL                |                             |
| score                    | int(11)      | YES  |     | NULL                |                             |
| view_count               | int(11)      | YES  |     | NULL                |                             |
| body_text                | text         | YES  |     | NULL                |                             |
| owner_id                 | int(11)      | NO   |     | NULL                |                             |
| last_editor_user_id      | int(11)      | YES  |     | NULL                |                             |
| last_editor_display_name | varchar(40)  | YES  |     | NULL                |                             |
| last_edit_date           | timestamp    | NO   |     | CURRENT_TIMESTAMP   | on update CURRENT_TIMESTAMP |
| last_activity_date       | timestamp    | NO   |     | 0000-00-00 00:00:00 |                             |
| title                    | varchar(256) | NO   |     | NULL                |                             |
| answer_count             | int(11)      | NO   |     | NULL                |                             |
| comment_count            | int(11)      | NO   |     | NULL                |                             |
| favorite_count           | int(11)      | NO   |     | NULL                |                             |
| created                  | timestamp    | NO   |     | 0000-00-00 00:00:00 |                             |
+--------------------------+--------------+------+-----+---------------------+-----------------------------+

我已将last_edit_date设置为非空列。

根据提供的答案,我做了更改,但错误仍然保持不变:

  def convert_to_mysql_time(date='1973-01-01T01:01:01.000')
    DateTime.parse(date).to_time.strftime("%F %T")
  end

  def on_start_element(element, attributes)
    if element == 'row'
      @post_st.execute(attributes['Id'], attributes['PostTypeId'], attributes['AcceptedAnswerId'], attributes['ParentId'], attributes['Score'], attributes['ViewCount'],
        attributes['Body'], attributes['OwnerUserId'] == nil ? -1 : attributes['OwnerUserId'], attributes['LastEditorUserId'], attributes['LastEditorDisplayName'],
        convert_to_mysql_time(attributes['LastEditDate']), DateTime.parse(attributes['LastActivityDate']).to_time.strftime("%F %T"), attributes['Title'] == nil ? '' : attributes['Title'],
        attributes['AnswerCount'] == nil ? 0 : attributes['AnswerCount'], attributes['CommentCount'] == nil ? 0 : attributes['CommentCount'],
        attributes['FavoriteCount'] == nil ? 0 : attributes['FavoriteCount'], DateTime.parse(attributes['CreationDate']).to_time.strftime("%F %T"))
      post_id = attributes['Id']

这是错误:

/ruby/1.9.2/ubuntuamd1/lib/ruby/1.9.1/date/format.rb:1031:in `dup': can't dup NilClass (TypeError)
    from /soft/ruby/1.9.2/ubuntuamd1/lib/ruby/1.9.1/date/format.rb:1031:in `_parse'
    from /soft/ruby/1.9.2/ubuntuamd1/lib/ruby/1.9.1/date.rb:1732:in `parse'
    from load.rb:102:in `convert_to_mysql_time'
    from load.rb:109:in `on_start_element'
    from load.rb:169:in `parse'
    from load.rb:169:in `<main>'

1 个答案:

答案 0 :(得分:2)

我会写一个方法,将String的日期转换为MySQL个日期,如果nil提供给方法,则提供一个默认值,例如:

def convert_to_my_sql_date(date)
    date = '1973-01-01T01:01:01.000' if (date.empty? rescue true) #was added since empty string gets supplied as an argument, and the rescue to make arguments that do not respond to empty? take a default date
    DateTime.parse(date).to_time.strftime("%F %T")
end

因此,当日期为零时,它使用默认值,那么您现在可以在方法中使用如下所示:

convert_to_my_sql_date(attributes['LastEditDate'])