为什么我使用Nokogiri修改XML会出错?

时间:2013-10-03 01:44:33

标签: ruby jenkins nokogiri net-http

我在理解Net :: HTTP和Nokogiri时遇到了问题。

我的Jenkins服务器上有大量的工作。我必须定期更新这些作业的分支名称。从UI执行它是一个麻烦的过程,所以我决定更新Jenkins config.xml。

我使用Nokogiri来解析XML,遍历XPath并更新节点的值。但是,当我尝试将更新后的XML发布回Jenkins时,我收到500错误说:

Caused by: javax.xml.transform.TransformerException: org.xml.sax.SAXParseExceptionpublicId: -//W3C//DTD HTML 4.0 Transitional//EN; systemId: http://www.w3.org/TR/REC-html40/loose.dtd; lineNumber: 31; columnNumber: 3; The declaration for the entity "HTML.Version" must end with '>'.

这是我正在做的事情:

require "net/http"
require "nokogiri"

uri = URI.parse("http://jenkins.my.domain.web:8080")
http = Net::HTTP.new(uri.host, uri.port)

getQueueRequest = Net::HTTP::Get.new("http://jenkins.my.domain.web:8080/my/job/location/config.xml")
getQueue = http.request(getQueueRequest)

xml_doc = Nokogiri::HTML(getQueue.body)

# Get current branch name
branch_name=xml_doc.at_xpath('//hudson.plugins.git.branchspec/name')

# Get new branch name
print "Enter new branch name "
user_input = gets.chomp
new_branch_name = user_input.downcase

# Set branch name and create xml
branch_name.content=new_branch_name
new_config_xml=xml_doc.to_xml

puts "Logging into Jenkins"

update_branch = Net::HTTP::Post.new("http://jenkins.my.domain.web:8080/my/job/location/config.xml")
update_branch.basic_auth 'username', 'password'
update_branch.body = new_config_xml

response = http.request(update_branch)

puts response.body

我知道它可能需要对添加到请求主体的XML做一些事情,但我不确定如何解决问题。

原始XML:

<?xml version='1.0' encoding='UTF-8'?>
<maven2-moduleset plugin="maven-plugin@1.504">
  <actions/>
  <description></description>
  <keepDependencies>false</keepDependencies>
  <properties>
    <hudson.plugins.throttleconcurrents.ThrottleJobProperty plugin="throttle-concurrents@1.7.2">
      <maxConcurrentPerNode>0</maxConcurrentPerNode>
      <maxConcurrentTotal>0</maxConcurrentTotal>
      <categories/>
      <throttleEnabled>false</throttleEnabled>
      <throttleOption>project</throttleOption>
      <configVersion>1</configVersion>
    </hudson.plugins.throttleconcurrents.ThrottleJobProperty>
  </properties>
  <scm class="hudson.plugins.git.GitSCM" plugin="git@1.4.0">
    <configVersion>2</configVersion>
    <userRemoteConfigs>
      <hudson.plugins.git.UserRemoteConfig>
        <name></name>
        <refspec></refspec>
        <url>git@github.com:<ORG_NAME>/<REPO_NAME>.git</url>
      </hudson.plugins.git.UserRemoteConfig>
    </userRemoteConfigs>
    <branches>
      <hudson.plugins.git.BranchSpec>
        <name>release</name>
      </hudson.plugins.git.BranchSpec>
    </branches>
    <disableSubmodules>false</disableSubmodules>
    <recursiveSubmodules>false</recursiveSubmodules>
    <doGenerateSubmoduleConfigurations>false</doGenerateSubmoduleConfigurations>
    <authorOrCommitter>false</authorOrCommitter>
    <clean>false</clean>
    <wipeOutWorkspace>false</wipeOutWorkspace>
    <pruneBranches>false</pruneBranches>
    <remotePoll>false</remotePoll>
    <ignoreNotifyCommit>false</ignoreNotifyCommit>
    <useShallowClone>false</useShallowClone>
    <buildChooser class="hudson.plugins.git.util.DefaultBuildChooser"/>
    <gitTool>Default</gitTool>
    <submoduleCfg class="list"/>
    <relativeTargetDir></relativeTargetDir>
    <reference></reference>
    <excludedRegions></excludedRegions>
    <excludedUsers></excludedUsers>
    <gitConfigName></gitConfigName>
    <gitConfigEmail></gitConfigEmail>
    <skipTag>false</skipTag>
    <includedRegions></includedRegions>
    <scmName></scmName>
  </scm>
  <canRoam>true</canRoam>
  <disabled>false</disabled>
  <blockBuildWhenDownstreamBuilding>false</blockBuildWhenDownstreamBuilding>
  <blockBuildWhenUpstreamBuilding>false</blockBuildWhenUpstreamBuilding>
  <triggers class="vector">
    <hudson.triggers.TimerTrigger>
      <spec>0 22 * * 4</spec>
    </hudson.triggers.TimerTrigger>
  </triggers>
  <concurrentBuild>false</concurrentBuild>
  <rootModule>
    <groupId>com.org.project.test</groupId>
    <artifactId>functest</artifactId>
  </rootModule>
  <goals>clean verify -Dtestsuite=<test_suite_name> -Dbrowser=chrome -Dipaddress=http://<IP_ADDRESS>:4444/wd/hub</goals>
  <mavenName>apache-maven-3.0.4</mavenName>
  <aggregatorStyleBuild>true</aggregatorStyleBuild>
  <incrementalBuild>false</incrementalBuild>
  <perModuleEmail>true</perModuleEmail>
  <ignoreUpstremChanges>false</ignoreUpstremChanges>
  <archivingDisabled>false</archivingDisabled>
  <resolveDependencies>false</resolveDependencies>
  <processPlugins>false</processPlugins>
  <mavenValidationLevel>-1</mavenValidationLevel>
  <runHeadless>false</runHeadless>
  <disableTriggerDownstreamProjects>false</disableTriggerDownstreamProjects>
  <settings class="jenkins.mvn.DefaultSettingsProvider"/>
  <globalSettings class="jenkins.mvn.DefaultGlobalSettingsProvider"/>
  <reporters/>
  <publishers/>
  <buildWrappers/>
  <prebuilders/>
  <postbuilders/>
  <runPostStepsIfResult>
    <name>FAILURE</name>
    <ordinal>2</ordinal>
    <color>RED</color>
  </runPostStepsIfResult>
</maven2-moduleset>

编辑和按摩后:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml version="1.0" encoding="UTF-8"?>
<html>
  <body>
    <maven2-moduleset plugin="maven-plugin@1.504">
      <actions />
      <description />
      <keepdependencies>false</keepdependencies>
      <properties>
        <hudson.plugins.throttleconcurrents.throttlejobproperty plugin="throttle-concurrents@1.7.2">
          <maxconcurrentpernode>0</maxconcurrentpernode>
          <maxconcurrenttotal>0</maxconcurrenttotal>
          <categories />
          <throttleenabled>false</throttleenabled>
          <throttleoption>project</throttleoption>
          <configversion>1</configversion>
        </hudson.plugins.throttleconcurrents.throttlejobproperty>
      </properties>
      <scm class="hudson.plugins.git.GitSCM" plugin="git@1.4.0">
        <configversion>2</configversion>
        <userremoteconfigs>
          <hudson.plugins.git.userremoteconfig>
            <name />
            <refspec />
            <url>git@github.com:<ORG_NAME>/<REPO_NAME>.git</url>
          </hudson.plugins.git.userremoteconfig>
        </userremoteconfigs>
        <branches>
          <hudson.plugins.git.branchspec>
            <name>master</name>
          </hudson.plugins.git.branchspec>
        </branches>
        <disablesubmodules>false</disablesubmodules>
        <recursivesubmodules>false</recursivesubmodules>
        <dogeneratesubmoduleconfigurations>false</dogeneratesubmoduleconfigurations>
        <authororcommitter>false</authororcommitter>
        <clean>false</clean>
        <wipeoutworkspace>false</wipeoutworkspace>
        <prunebranches>false</prunebranches>
        <remotepoll>false</remotepoll>
        <ignorenotifycommit>false</ignorenotifycommit>
        <useshallowclone>false</useshallowclone>
        <buildchooser class="hudson.plugins.git.util.DefaultBuildChooser" />
        <gittool>Default</gittool>
        <submodulecfg class="list" />
        <relativetargetdir />
        <reference />
        <excludedregions />
        <excludedusers />
        <gitconfigname />
        <gitconfigemail />
        <skiptag>false</skiptag>
        <includedregions />
        <scmname />
      </scm>
      <canroam>true</canroam>
      <disabled>false</disabled>
      <blockbuildwhendownstreambuilding>false</blockbuildwhendownstreambuilding>
      <blockbuildwhenupstreambuilding>false</blockbuildwhenupstreambuilding>
      <triggers class="vector">
        <hudson.triggers.timertrigger>
          <spec>0 22 * * 4</spec>
        </hudson.triggers.timertrigger>
      </triggers>
      <concurrentbuild>false</concurrentbuild>
      <rootmodule>
        <groupid>com.org.project.test</groupid>
        <artifactid>functest</artifactid>
      </rootmodule>
      <goals>clean verify -Dtestsuite=<test_suite_name> -Dbrowser=chrome -Dipaddress=http://<IP_ADDRESS>:4444/wd/hub</goals>
      <mavenname>apache-maven-3.0.4</mavenname>
      <aggregatorstylebuild>true</aggregatorstylebuild>
      <incrementalbuild>false</incrementalbuild>
      <permoduleemail>true</permoduleemail>
      <ignoreupstremchanges>false</ignoreupstremchanges>
      <archivingdisabled>false</archivingdisabled>
      <resolvedependencies>false</resolvedependencies>
      <processplugins>false</processplugins>
      <mavenvalidationlevel>-1</mavenvalidationlevel>
      <runheadless>false</runheadless>
      <disabletriggerdownstreamprojects>false</disabletriggerdownstreamprojects>
      <settings class="jenkins.mvn.DefaultSettingsProvider" />
      <globalsettings class="jenkins.mvn.DefaultGlobalSettingsProvider" />
      <reporters />
      <publishers />
      <buildwrappers />
      <prebuilders />
      <postbuilders />
      <runpoststepsifresult>
        <name>FAILURE</name>
        <ordinal>2</ordinal>
        <color>RED</color>
      </runpoststepsifresult>
    </maven2-moduleset>
  </body>
</html>

2 个答案:

答案 0 :(得分:2)

当您使用Nokogiri::HTML(some_html)Nokogiri::XML(some_xml)时,Nokogiri会查看内容是否有效。如果不是,它将对内容进行修复以尝试实现。例如:

require 'nokogiri'

html_fragment = "<p>foo bar</p>"
Nokogiri::HTML(html_fragment).to_html 
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo bar</p></body></html>\n"

如果文件部分正确,Nokogiri仍然会添加DOCTYPE声明:

html = "<html><body><p>foo bar</p></body></html>"
Nokogiri::HTML(html).to_html 
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo bar</p></body></html>\n"

如果你想让Nokogiri留下文件,因为它应该是一个片段,告诉它这样做:

Nokogiri::HTML::DocumentFragment.parse(html_fragment).to_html 
# => "<p>foo bar</p>"

或者:

xml_fragment = "<x>foo bar</x>"
Nokogiri::XML::DocumentFragment.parse(xml_fragment).to_xml 
# => "<x>foo bar</x>"

Nokogiri在处理XML和HTML方面非常聪明。你可以试着混淆它,它通常会做正确的事情:

xml_fragment = "<x>foo bar</x>"
Nokogiri::HTML::DocumentFragment.parse(xml_fragment).to_xml 
# => "<x>foo bar</x>"

将XML解析为HTML片段并告诉它将其作为XML发送。

现在,所有人都说,很明显Nokogiri没有做任何神秘的事情,所以,这里是如何解决这个问题。首先,将其解析为XML,以便Nokogiri认为它不应该添加HTML DOCTYPE声明,然后,如果XML在语法上是正确的,告诉Nokogiri可以将它解析为完整的文档:

require 'nokogiri'

xml = %{<?xml version='1.0' encoding='UTF-8'?>
<maven2-moduleset plugin="maven-plugin@1.504">
  <actions/>
  <description></description>
  <keepDependencies>false</keepDependencies>
  <properties>
    <hudson.plugins.throttleconcurrents.ThrottleJobProperty plugin="throttle-concurrents@1.7.2">
    </hudson.plugins.throttleconcurrents.ThrottleJobProperty>
  </properties>
</maven2-moduleset>
}
puts Nokogiri::XML.parse(xml).to_xml 

# >> <?xml version="1.0" encoding="UTF-8"?>
# >> <maven2-moduleset plugin="maven-plugin@1.504">
# >>   <actions/>
# >>   <description/>
# >>   <keepDependencies>false</keepDependencies>
# >>   <properties>
# >>     <hudson.plugins.throttleconcurrents.ThrottleJobProperty plugin="throttle-concurrents@1.7.2">
# >>     </hudson.plugins.throttleconcurrents.ThrottleJobProperty>
# >>   </properties>
# >> </maven2-moduleset>

或者作为一个片段,因为它完整,会产生同样的事情:

puts Nokogiri::XML::DocumentFragment.parse(xml).to_xml 

# >> <?xml version='1.0' encoding='UTF-8'?>
# >> <maven2-moduleset plugin="maven-plugin@1.504">
# >>   <actions/>
# >>   <description/>
# >>   <keepDependencies>false</keepDependencies>
# >>   <properties>
# >>     <hudson.plugins.throttleconcurrents.ThrottleJobProperty plugin="throttle-concurrents@1.7.2">
# >>     </hudson.plugins.throttleconcurrents.ThrottleJobProperty>
# >>   </properties>
# >> </maven2-moduleset>

我建议不要使用Net :: HTTP(这是HTTP的基本构建块),而是建议使用更高级别的东西,比如HTTPClient。这里的代码类似于你的代码:

require 'httpclient'
require 'nokogiri'

URL = 'http://jenkins.my.domain.web:8080/my/job/location/config.xml'

http_client = HTTPClient.new
xml_doc = Nokogiri::HTML(
  http_client.get_content(URL)
)

# Get current branch name using CSS for simplicity:
branch_name = xml_doc.at('hudson.plugins.git.branchspec name')

# Get new branch name
print 'Enter new branch name '
new_branch_name = gets.chomp.downcase

# Set branch name and create xml
branch_name.content = new_branch_name

puts 'Logging into Jenkins'

http_client.set_auth(domain, 'user', 'password')

response = http_client.post(URL, :body => xml_doc.to_xml)

我无法测试它,但看起来很接近。


  

我现在发现自己处于另一个两难境地。我看到允许移动到元素和编辑值的方法,如at_xpath,at_css只适用于Nokogiri :: HTML或Nokogiri :: HTML :: DocumentFragment。当我使用Nokogiri :: XML时它们不起作用。使用Nokogiri :: HTML更改了HTML标记的大小写。虚假变得虚假。 Jenkins确实接受了带有更改标签的xml。方法to_html,to_xml基本上返回一个字符串,所以我不能使用xpath或css方法来导航xml树。有办法吗?

at方法适用于XML和HTML,并允许CSS和XPath选择器; Nokogiri内部的一切都是基于XML的。

Nokogiri将HTML标记折叠为小写,因为HTML不区分大小写,因此at在处理HTML时需要小写值。 XML区分大小写,因此Nokogiri单独留下标记大小写,并且at要求您在使用CSS时使用正确的大小写。

the Nokogiri docs中记录了这一点:

  

请注意,CSS查询字符串与文档类型有区分大小写。也就是说,如果你在HTML文档中寻找“H1”,你将永远找不到任何东西,因为HTML标签只会匹配小写的CSS查询。但是,“H1”可能在XML文档中找到,其中标记名称区分大小写(例如,“H1”与“h1”不同)。

答案 1 :(得分:0)

当您解析从服务接收的XML时,您将其声明为HTML:

xml_doc = Nokogiri :: HTML(getQueue.body)

这似乎会导致Nokogiri添加HTML节点。

尝试将其解析为XML:

xml_doc = Nokogiri :: XML(getQueue.body)