Question

我正在尝试编写一个解析HTML字符串并从特定节点获取某些值的ruby脚本。

目前我正在努力将字符串读入Nokogiri文档：

此代码：

#!/usr/bin/ruby

html_doc = Nokogiri::HTML("<html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>")

产生此错误：

$ ruby emailParser.rb 
emailParser.rb:3: syntax error, unexpected tIDENTIFIER, expecting ')'
...ML("<html>  <meta content="text/html; charset=UTF-8"/>  <bod...
...                               ^
emailParser.rb:3: syntax error, unexpected tSTRING_BEG, expecting end-of-input
...tent="text/html; charset=UTF-8"/>  <body style='margin:20px'...
...                               ^

请注意，我在这里尝试了相同的解决方案：

"syntax error, unexpected tIDENTIFIER, expecting $end"

Answer 1

您必须更改＆＃34;中的html字符串引号到＆＃39;并将字符串引号在 html中更改为＆＃34;。这样的事情应该有效：

#!/usr/bin/ruby

html_doc = Nokogiri::HTML('<html>  <meta content="text/html; charset=UTF-8"/>  <body style="margin:20px">    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style="list-style-type:none; margin:25px 15px;">      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom\'s iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style="height=2px; color:#aaa"/>        <p>We hope you enjoy the app store experience!</p>        <p style="font-size:18px; color:#999">Powered by App47</p>      <img src="https://cirrus.app47.com/notifications/562506219ac25b1033000904/img" alt=""/></body></html>')

Answer 2

问题是你的字符串中有双引号会使解析器混乱，因为你还使用双引号来包围字符串。举例说明：

puts "foo"bar"
# => SyntaxError: unexpected tIDENTIFIER, expecting end-of-input
#    puts "foo"bar"
#                 ^

您可能打算这样打印foo"bar，但是当解析器到达第二个"时（在foo之后），它认为字符串已经结束，因此它之后的内容导致语法错误。（Stack Overflow＆＃39;语法突出显示甚至会给你一个提示 - 看看第一行"foo"的颜色与bar"的颜色有何不同？一个好的语法高亮文本编辑器会做同样的事情。）

一种解决方案是使用单引号：

puts 'bar"baz'
# => bar"baz

这解决了这种情况下的问题，但实际上并没有帮到你，因为你的字符串里面还有单引号！

另一个解决方案是转义你的引号，前面加上\，如下所示：

puts "foo\"bar"
# => foo"bar

...但对于像你这样的长字符串，这会变得有点乏味（有时候很棘手）。一个更好的解决方案是使用一种特殊的字符串，称为＆＃34; heredoc＆＃34; （对于＆＃34;此处为文档，＆＃34;为了它的价值）：

str = <<-END_OF_HTML
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

html_doc = Nokogiri::HTML(str)

分隔符＆＃34; END_OF_HTML＆＃34;是任意的。你可以使用EOF或XYZZY或任何适合自己喜欢的东西，尽管使用有意义的东西是个好主意。（你会注意到Stack Overflow的语法高亮显示对heredocs有点麻烦;但是大多数代码编辑都对它们很好。）

你可以像这样更紧凑：

Nokogiri::HTML <<-END_OF_HTML
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

或者用括号（它看起来有点奇怪，但它有效，有时是必要的）：

Nokogiri::HTML(<<-END_OF_HTML)
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

您可以在Ruby文档的Literals部分中阅读有关heredocs以及其他表示字符串的方法的更多信息。

使用Nokogiri解析HTML字符串

2 个答案: