Question

我的格式如下：

AUTHOR, "TITLE" (PAGES pp.) [CODE STATUS]

例如，我有一个字符串

P.G. Wodehouse, "Heavy Weather" (336 pp.) [PH.409 AVAILABLE FOR LENDING]

我想提取

AUTHOR = P.G. Wodehouse
TITLE = Heavy Weather
PAGES = 336
CODE = PH.409
STATUS = AVAILABLE FOR LENDING

我只知道如何在Python中做到这一点，但是，有没有有效的方法在C ++中做同样的事情？

Answer 1

与Python完全相同。 C ++ 11有正则表达式（对于早期的C ++，还有Boost正则表达式。）至于读循环：

std::string line;
while ( std::getline( file, line ) ) {
    //  ...
}

几乎与：

完全相同

for line in file:
    #    ...

唯一的区别是：

C ++版本不会将尾随'\n'放入缓冲区。（一般来说，关于行尾处理，C ++版本可能不太灵活。）
如果出现读取错误，C ++版本将中断循环; Python版本会引发异常。

在您的情况下，两者都不应成为问题。

编辑：

我觉得虽然C ++和Python中的正则表达式非常相似，但使用它们的语法并不完全相同。所以：

在C ++中，您通常在使用它之前声明正则表达式的实例;像Python re.match( r'...', line )这样的东西在理论上是可行的，但不是非常惯用的（它仍然涉及在表达式中显式构造正则表达式对象）。此外，match函数只返回一个布尔值;如果你想要捕获，你需要为它们定义一个单独的对象。典型的用途可能是：

static std::regex const matcher( "the regular expression" );
std::smatch forCaptures;
if ( std::regex_match( line, forCaptures, matcher ) ) {
    std::string firstCapture = forCaptures[1];
    //  ...
}

这对应于Python：

m = re.match( 'the regular expression', line )
if m:
    firstCapture = m.group(1)
    #   ...

编辑：

另一个答案是建议重载operator>>;我衷心同意。出于好奇，我试了一下;类似下面的东西效果很好：

struct Book
{
    std::string author;
    std::string title;
    int         pages;
    std::string code;
    std::string status;
};

std::istream&
operator>>( std::istream& source, Book& dest )
{
    std::string line;
    std::getline( source, line );
    if ( source )
    {
        static std::regex const matcher(
            R"^(([^,]*),\s*"([^"]*)"\s*\((\d+) pp.\)\s*\[(\S+)\s*([^\]]*)\])^"
            ); 
        std::smatch capture;
        if ( ! std::regex_match( line, capture, matcher ) ) {
            source.setstate( std::ios_base::failbit );
        } else {
            dest.author = capture[1];
            dest.title  = capture[2];
            dest.pages  = std::stoi( capture[3] );
            dest.code   = capture[4];
            dest.status = capture[5];
        }
    }
    return source;
}

完成此操作后，您可以编写如下内容：

std::vector<Book> v( (std::istream_iterator<Book>( inputFile )),
                     (std::istream_iterator<Book>()) );

在向量的初始化中加载整个文件。

请注意operator>>中的错误处理。如果一行错误，我们设置failbit;这是C ++中的标准约定。

编辑：

由于讨论如此之多：以上内容适用于小型，一次性程序，学校项目或一次性程序，它们将读取当前文件，以新格式输出，然后被扔掉在生产代码中，我会坚持支持评论和空洞;如果出现错误，则继续报告多个错误（包含行号），以及可能的延续行（因为标题可能会变得足够长而变得难以置信）。使用operator>>执行此操作是不切实际的，如果除了需要输出行号之外没有其他原因，那么我将使用以下行中的解析器：

int
getContinuationLines( std::istream& source, std::string& line )
{
    int results = 0;
    while ( source.peek() == '&' ) {
        std::string more;
        std::getline( source, more );   //  Cannot fail, because of peek
        more[0] = ' ';
        line += more;
        ++ results;
    }
    return results;
}

void
trimComment( std::string& line )
{
    char quoted = '\0';
    std::string::iterator position = line.begin();
    while ( position != line.end() && (quoted != '\0' || *position == '#') ) {
        if ( *position == '\' && std::next( position ) != line.end() ) {
            ++ position;
        } else if ( *position == quoted ) {
            quoted = '\0';
        } else if ( *position == '\"' || *position == '\'' ) {
            quoted = *position;
        }
        ++ position;
    }
    line.erase( position, line.end() );
}

bool
isEmpty( std::string const& line )
{
    return std::all_of(
        line.begin(),
        line.end(),
        []( unsigned char ch ) { return isspace( ch ); } );
}

std::vector<Book>
parseFile( std::istream& source )
{
    std::vector<Book> results;
    int lineNumber = 0;
    std::string line;
    bool errorSeen = false;
    while ( std::getline( source, line ) ) {
        ++ lineNumber;
        int extraLines = getContinuationLines( source, line );
        trimComment( line );
        if ( ! isEmpty( line ) ) {
            static std::regex const matcher(
                R"^(([^,]*),\s*"([^"]*)"\s*\((\d+) pp.\)\s*\[(\S+)\s*([^\]]*)\])^"
                ); 
            std::smatch capture;
            if ( ! std::regex_match( line, capture, matcher ) ) {
                std::cerr << "Format error, line " << lineNumber << std::endl;
                errorSeen = true;
            } else {
                results.emplace_back(
                    capture[1],
                    capture[2],
                    std::stoi( capture[3] ),
                    capture[4],
                    capture[5] );
            }
        }
        lineNumber += extraLines;
    }
    if ( errorSeen ) {
        results.clear();    //  Or more likely, throw some sort of exception.
    }
    return results;
}

这里真正的问题是你如何向调用者报告错误;我怀疑在大多数情况下，异常是合适的，但根据用例，其他替代方案也可能有效。在这个例子中，我只返回一个空向量。（注释和延续线之间的相互作用可能也需要更好地定义，并根据它的定义进行修改。）

Answer 2

您的输入字符串分隔很好，因此为了速度和易用性，我建议在regex上使用提取运算符。

您首先需要为自己的图书创建struct：

struct book{
    string author;
    string title;
    int pages;
    string code;
    string status;
};

然后您需要编写实际的提取运算符：

istream& operator>>(istream& lhs, book& rhs){
    lhs >> ws;
    getline(lhs, rhs.author, ',');
    lhs.ignore(numeric_limits<streamsize>::max(), '"');
    getline(lhs, rhs.title, '"');
    lhs.ignore(numeric_limits<streamsize>::max(), '(');
    lhs >> rhs.pages;
    lhs.ignore(numeric_limits<streamsize>::max(), '[');
    lhs >> rhs.code >> ws;
    getline(lhs, rhs.status, ']');
    return lhs;
}

这为您提供了巨大的力量。例如，您可以将istream中的所有图书提取为vector，如下所示：

istringstream foo("P.G. Wodehouse, \"Heavy Weather\" (336 pp.) [PH.409 AVAILABLE FOR LENDING]\nJohn Bunyan, \"The Pilgrim's Progress\" (336 pp.) [E.1173 CHECKED OUT]");
vector<book> bar{ istream_iterator<book>(foo), istream_iterator<book>() };

Answer 3

使用flex（它生成C或C ++代码，用作部分或完整程序）

%%
^[^,]+/,          {printf("Autor: %s\n",yytext  );}
\"[^"]+\"         {printf("Title: %s\n",yytext  );}
\([^ ]+/[ ]pp\.   {printf("Pages: %s\n",yytext+1);}
..................
.|\n              {}
%%

（未测试的）

Answer 4

以下是代码：

#include <iostream>
#include <cstring>

using namespace std;

string extract (string a)
{
    string str = "AUTHOR = "; //the result string
    int i = 0;
    while (a[i] != ',')
        str += a[i++];
    while (a[i++] != '\"');

    str += "\nTITLE = ";
    while (a[i] != '\"')
        str += a[i++];
    while (a[i++] != '(');

    str += "\nPAGES = ";
    while (a[i] != ' ')
        str += a[i++];
    while (a[i++] != '[');

    str += "\nCODE = ";
    while (a[i] != ' ')
        str += a[i++];
    while (a[i++] == ' ');

    str += "\nSTATUS = ";
    while (a[i] != ']')
        str += a[i++];
    return str;
}

int main ()
{
    string a;
    getline (cin, a);
    cout << extract (a) << endl;
    return 0;
}

快乐编码：）

如何有效地提取C ++中的字符串模式？

4 个答案:

编辑：