Question

我一直试图解决这个问题一段时间，但仍然没有找到答案。目的是从HTML网页获取一些数据。我可以做所有互联网相关的部分，但我有一个问题。这是我的字符串：

类= “数据流-图表值” ＆GT; 496

问题是那些引号，因为否则我的应用程序将能够获得“496”这是重要的数据，但是在那里，我无法获取我的数据。

哪种方法可以获得这些数据？（请注意，在“＆gt;”符号后面有一个“\ n”）

谢谢伙伴们！

Answer 1

虽然我通常不建议使用正则表达式来读取xml，但使用XML解析器的HTML可能是噩梦。

使用以下示例。

<a class="datastream-graph-value" href="http=blah" > 496</a>
<a class="other"> 496</a>

使用下面的正则表达式来处理它。

(class=["][^>"]*["])

给出了如何使用该正则表达式的一个很好的例子。 http://www.vogella.com/articles/JavaRegularExpressions/article.html

如果您需要回复代码示例，我们会看到我们无法解决的问题。

编辑：

我很无聊所以我想为什么不把样品放在一起

package temp;


import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTestPatternMatcher {
  public static final String EXAMPLE_TEST = "<a class=\"datastream-graph-value\" href=\"http=blah\" > 496</a> <a class=\"other\"> 496</a>";

  public static void main(String[] args) {
    Pattern pattern = Pattern.compile("(class=[\"][^>\"]*[\"])");
    // In case you would like to ignore case sensitivity you could use this
    // statement
    // Pattern pattern = Pattern.compile("\\s+", Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(EXAMPLE_TEST);
    // Check all occurance
    while (matcher.find()) {
      System.out.print("Start index: " + matcher.start());
      System.out.print(" End index: " + matcher.end() + " ");
      String match = matcher.group();
      match = match.replace("class=", "");
      System.out.println(match);
    }
    // Now create a new pattern and matcher to replace whitespace with tabs
    Pattern replace = Pattern.compile("\\s+");
    Matcher matcher2 = replace.matcher(EXAMPLE_TEST);
    System.out.println(matcher2.replaceAll("\t"));
  }
}

解析内部带引号的字符串

1 个答案: