我正在使用Nokogiri代码在HTML节点之间提取文本,并在我读入文件列表时收到这些错误。我没有使用简单的嵌入式HTML获取错误。我想消除或抑制警告,但不知道如何。警告出现在每个街区的末尾:
extract.rb:18: warning: already initialized constant EXTRACT_RANGES
extract.rb:25: warning: already initialized constant DELIMITER_TAGS
这是我的代码:
#!/usr/bin/env ruby -wKU
require 'rubygems'
require 'nokogiri'
require 'fileutils'
source = File.open('/documents.txt')
source.readlines.each do |line|
line.strip!
if File.exists? line
file = File.open(line)
doc = Nokogiri::HTML(File.read(line))
# suggested by dan healy, stackoverflow
# Specify the range between delimiter tags that you want to extract
# triple dot is used to exclude the end point
# 1...2 means 1 and not 2
EXTRACT_RANGES = [
1...2
]
# Tags which count as delimiters, not to be extracted
DELIMITER_TAGS = [
"h1",
"h2",
"h3"
]
extracted_text = []
i = 0
# Change /"html"/"body" to the correct path of the tag which contains this list
(doc/"html"/"body").children.each do |el|
if (DELIMITER_TAGS.include? el.name)
i += 1
else
extract = false
EXTRACT_RANGES.each do |cur_range|
if (cur_range.include? i)
extract = true
break
end
end
if extract
s = el.inner_text.strip
unless s.empty?
extracted_text << el.inner_text.strip
end
end
end
end
print("\n")
puts line
print(",\n")
# Print out extracted text (each element's inner text is separated by newlines)
puts extracted_text.join("\n\n")
end
end
答案 0 :(得分:2)
如果代码正确缩进,则更容易注意到常量定义是在循环内完成的。
比较
source.readlines.each do |line|
# code
if true
# Wrongly indented code
# More
# Wrongly
# Indented
# Code
EXTRACT_RANGES = [
1...2
]
# Several more pages of code
end
end
与
source.readlines.each do |line|
# code
if true
# Correctly indented code
# What is a constant doing being defined
# this far indented?
# Oh no - it's in a loop!
EXTRACT_RANGES = [
1...2
]
# Several more pages of code
end
end
答案 1 :(得分:1)
之前没有注意到。只需将常量移出每个块
EXTRACT_RANGES = [
1...2
]
# Tags which count as delimiters, not to be extracted
DELIMITER_TAGS = [
"h1",
"h2",
"h3"
]
source.readlines.each do |line|
line.strip!
if File.exists? line
file = File.open(line)
doc = Nokogiri::HTML(File.read(line))
extracted_text = []
i = 0
# Change /"html"/"body" to the correct path of the tag which contains this list
(doc/"html"/"body").children.each do |el|
if (DELIMITER_TAGS.include? el.name)
i += 1
else
extract = false
EXTRACT_RANGES.each do |cur_range|
if (cur_range.include? i)
extract = true
break
end
end
if extract
s = el.inner_text.strip
unless s.empty?
extracted_text << el.inner_text.strip
end
end
end
end
print("\n")
puts line
print(",\n")
# Print out extracted text (each element's inner text is separated by newlines)
puts extracted_text.join("\n\n")
end
end
答案 2 :(得分:0)
作为编程提示:
小心使用...
与..
进行范围定义。三点版本不像双点版本那样常用,并且额外的点很容易错过,使得代码难以维护。我必须有一个很好的理由使用三点。比较IRB的这些输出:
(1...2).to_a
=> [1]
VS
(1..1).to_a
=> [1]
看看第一个是多么误导。