Question

我有一个包含数据的HTML文档：

/**
 * Get ISO 3166-1 alpha-2 country code for this device (or null if not available)
 * @param context Context reference to get the TelephonyManager instance from
 * @return country code or null
 */
public static String getUserCountry(Context context) {
    try {
        final TelephonyManager tm = (TelephonyManager) context.getSystemService(Context.TELEPHONY_SERVICE);
        final String simCountry = tm.getSimCountryIso();
        if (simCountry != null && simCountry.length() == 2) { // SIM country code is available
            return simCountry.toLowerCase(Locale.US);
        }
        else if (tm.getPhoneType() != TelephonyManager.PHONE_TYPE_CDMA) { // device is not 3G (would be unreliable)
            String networkCountry = tm.getNetworkCountryIso();
            if (networkCountry != null && networkCountry.length() == 2) { // network country code is available
                return networkCountry.toLowerCase(Locale.US);
            }
        }
    }
    catch (Exception e) { }
    return null;
}

解析时使用

：

<div>
    <p class="someclass">
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
        </ul>
    </p>
</div>

当我检查数据库时，我只获得外部div_node.children.each do |child| if child.node_name == 'p' #store it as html string in db store(child.to_html) end end标记：

<p>

不存储或检索内部<p class="someclass"> </p>标记内容。

我知道<ul>标记不能包含<p>标记，但我们从客户端获取的文档包含数据，并且大约有1000个文档包含数据，因此我无法手动编辑它们

Answer 1

尝试使用Nokogiri::XML解析器而不是Nokogiri::HTML解析器。它不应该关心标记语义，但我不确定它将如何处理HTML5中那些不是有效XML的部分。

Answer 2

我最终使用Nokogiri::XML解析器来解析HTML doc

我不得不在很多地方改变我的剧本

解析代码

@xml_doc = Nokogiri::XML.parse(file) { |cfg| cfg.noblanks }
@xml_doc.remove_namespaces!

完成更改

将attribute方法更改为attr
此处不需要使用attr方法链接text
需要检查无效的HTML5代码
需要更多解析逻辑更改
node.to_html就像一个魅力，所以我能够在db

使用Nokogiri解析HTML（没有遵循HTML语义）

2 个答案: