Question

我在c中从http服务器下载一个jsp文件。但我得到的文件内容如图所示

<HTML>
<BODY>
Hello, user
</BODY>
</HTML>

进入缓冲区。现在我想只将“Hello，user”捕获到我的缓冲区中。任何人都可以帮我找到C中的代码。

Answer 1

使用libexpat。这是一个用C编写的面向流的xml解析器。您可以为BODY标记注册处理程序并读取内容

看一下这个问题Geting xml data using xml parser expat

Answer 2

基本上你想要扫描缓冲区并忽略<和>之间的所有内容：

char *get_text (char *dst, char *src) {
  int html = 0;
  char ch;

  while (ch = *src++) {
    if (ch == '<' || ch == '>') {
      html = (ch == '<');
    } else if (!html) {
      *dst++ = ch;
    }
  }

  *dst = '\0';
  return dst;
}

Answer 3

您可以尝试剥离HTML，但如果标签外有更多内容（可能需要更具体的过滤，例如检查周围的标签名称），这可能无法正常工作。

未经测试但应该有效：

char *html = ...; // html being a pointer to the document's contents
int ip = 0; // the input position
int op = 0; // the ouput position
int in_tag = 0; // are we inside a html tag?
char c; // current character
while(c = html[ip++])
{
    if(c == '<')
        in_tag = 1;
    else if(c == '>')
        in_tag = 0;
    else if(c == '\n' || c == '\r') // strip line breaks
        ;
    else if(!in_tag)
        html[op++] = c;
}
html[op] = '\0';

在c中将指定的内容放入缓冲区

3 个答案: