Question

ifstream toOpen;
openFile.open("sample.html", ios::in); 

if(toOpen.is_open()){
    while(!toOpen.eof()){
        getline(toOpen,line);
        if(line.find("href=") && !line.find(".pdf")){   
                start_pos = line.find("href"); 
        tempString = line.substr(start_pos+1); // i dont want the quote
            stop_pos = tempString .find("\"");
                string testResult = tempString .substr(start_pos, stop_pos);
        cout << testResult << endl;
        }
    }

    toOpen.close();
}

我想要做的是提取“href”值。但是我无法让它发挥作用。

编辑：

感谢Tony暗示，我使用了这个：

if(line.find("href=") != std::string::npos ){   
    // Process
}

它有效!!

Answer 1

我建议不要试图像这样解析HTML。除非你对源有很多了解并且完全确定它将如何被格式化，否则你所做的任何事情都可能会有问题。 HTML是一种丑陋的语言，具有（几乎）自相矛盾的规范（例如）说不允许特定的事情 - 但接着告诉你如何要求你解释它们

更糟糕的是，几乎任何角色都可以（至少可能）以至少三种或四种不同的方式进行编码，因此，除非您首先扫描（并执行）正确的转换（以正确的顺序），否则您可以最终遗漏了合法链接和/或包括“幻影”链接。

您可能希望查看此previous question的答案，以获取有关要使用的HTML解析器的建议。

Answer 2

首先，您可能希望采用一些快捷方式，将循环写入循环，以使其更清晰。这是使用C ++ iostreams的传统“一次读取行”循环：

#include <fstream>
#include <iostream>
#include <string>

int main ( int, char ** )
{
    std::ifstream file("sample.html");
    if ( !file.is_open() ) {
        std::cerr << "Failed to open file." << std::endl;
        return (EXIT_FAILURE);
    }
    for ( std::string line; (std::getline(file,line)); )
    {
        // process line.
    }
}

对于处理线的内部部分，存在几个问题。

它不编译。我想这就是你所说的“我无法让它发挥作用”。在提问时，这是您可能想要提供的信息，以便获得良好的帮助。
变量名称temp和tempString等之间存在混淆。
string::find()返回一个大的正整数来表示无效位置（size_type是无符号的），因此除非从字符位置0开始找到匹配，否则您将始终进入循环，在这种情况下，你可能做想要进入循环。

以下是sample.html的简单测试内容。

<html>
    <a href="foo.pdf"/>
</html>

在循环中粘贴以下内容：

if ((line.find("href=") != std::string::npos) &&
    (line.find(".pdf" ) != std::string::npos))
{
    const std::size_t start_pos = line.find("href");
    std::string temp = line.substr(start_pos+6);
    const std::size_t stop_pos = temp.find("\"");
    std::string result = temp.substr(0, stop_pos);
    std::cout << "'" << result << "'" << std::endl;
}

我实际上得到了输出

'foo.pdf'

然而，正如Jerry指出的那样，您可能不希望在生产环境中使用它。如果这是关于如何使用<string>，<iostream>和<fstream>库的简单作业或练习，请继续执行此类过程。

读取文件并仅提取某些部分

2 个答案: