如何在C ++中解析具有不同字段数的行

时间:2009-04-13 06:31:43

标签: c++ parsing stringstream

我的数据如下:

AAA 0.3 1.00 foo chr1,100
AAC 0.1 2.00 bar chr2,33
AAT 3.3 2.11     chr3,45
AAG 1.3 3.11 qux chr1,88
ACA 2.3 1.33     chr8,13
ACT 2.3 7.00 bux chr5,122

请注意,上面的行是制表符分隔的。此外, 它有时可能包含5个字段或4个字段。

我想要做的是将变量中的第4个字段捕获为“”,如果它不包含任何值。

我有以下代码,但它以某种方式读取第5个字段,作为第4个字段 当第四个是空的时候。

这样做的正确方法是什么?

#include <iostream>
#include <vector>
#include <fstream>
#include <sstream>
using namespace std;

int main  ( int arg_count, char *arg_vec[] ) {
    string line;
    ifstream myfile (arg_vec[1]);

    if (myfile.is_open())
    {
        while (getline(myfile,line) )
        {
            stringstream ss(line);    
            string Tag;  
            double Val1;
            double Val2;
            double Field4;
            double Field5;

            ss >> Tag >> Val1 >> Val2 >> Field4 >> Field5;
            cout << Field4 << endl;
            //cout << Tag << "," << Val1 << "," << Val2 << "," << Field4 << "," << Field5 << endl;

        }
        myfile.close();
    }
    else { cout << "Unable to open file"; }
    return 0;
}

7 个答案:

答案 0 :(得分:6)

将行标记为字符串向量,然后根据标记的数量转换为适当的数据类型。

如果你可以使用Boost.Spirit,这就减少了定义合适语法的简单问题。

答案 1 :(得分:4)

如果您想尝试Boost.Spirit,请从此开始。它确实编译,我已经测试了一下。它似乎工作正常。

#include <iostream>
#include <vector>
#include <fstream>
#include <sstream>
#include <list>
#include <boost/spirit/core.hpp>
#include <boost/spirit/actor/assign_actor.hpp>

using namespace std;
using namespace boost::spirit;

struct OneLine
{
        string tag;
        double val1;
        double val2;
        string field4;
        string field5;
};

int main  ( int arg_count, char *arg_vec[] ) {
    string line;
    ifstream myfile (arg_vec[1]);
    list<OneLine> myList;

    if (myfile.is_open())
    {
        while (getline(myfile,line) )
        {
                OneLine result;
                rule<> good_p(alnum_p|punct_p);
                parse( line.c_str(),
                    (*good_p)[assign_a(result.tag)] >> ch_p('\t') >>
                    real_p[assign_a(result.val1)] >> ch_p('\t') >>
                    real_p[assign_a(result.val2)] >> ch_p('\t') >>
                    (*good_p)[assign_a(result.field4)] >> ch_p('\t') >>
                    (*good_p)[assign_a(result.field5)],
                    ch_p(";") );

                myList.push_back( result );
        }
        myfile.close();
    }
    else { cout << "Unable to open file"; }
    return 0;
}

答案 2 :(得分:4)

仅使用istream这一事实的另一个C ++版本必须设置failbit if运算符&gt;&gt;无法解析。

while(getline(ss, line))
{
    stringstream sl(line);

    sl >> tag >> v1 >> v2 >> v3 >> v4;

    if(sl.rdstate() == ios::failbit) // failed to parse 5 arguments?
    {
        sl.clear();
        sl.seekg(ios::beg);
        sl >> tag >> v1 >> v2 >> v4; // do it again with 4
        v3 = "EMPTY"; // just a default value
    }


    cout << "tag: " << tag <<std::endl
        << "v1: " << v1 << std::endl
        << "v2: " << v2 << std::endl
        << "v3: " << v3 << std::endl
        << "v4: " << v4 << std::endl << std::endl;
}

答案 3 :(得分:2)

有了提升:

int main()
{
    std::ifstream in("parsefile.in");

    if (!in)
        return 1;

    typedef std::istreambuf_iterator<char> InputIterator;
    typedef boost::char_separator<char> Separator;
    typedef boost::tokenizer< Separator, InputIterator > Tokenizer;

    Tokenizer tokens(InputIterator(in),
                     InputIterator(),
                     Separator(",\t\n", "", boost::keep_empty_tokens));

    const std::size_t columnsCount = 6;
    std::size_t columnNumber = 1;
    for(Tokenizer::iterator it = tokens.begin(); 
        it != tokens.end(); 
        ++it)
    {
        const std::string value = *it;

        if ( 2 == columnNumber )
        {
            const double d = convertToDouble(value);
        }

        std::cout << std::setw(10) << value << "|";

        if ( columnsCount == columnNumber )
        {
            std::cout << std::endl;
            columnNumber = 1;
        }
        else
        {
            ++columnNumber;
        }
    }

    return 0;
}

没有提升:

int main()
{
    std::ifstream in("parsefile.in");

    if (!in)
        return 1;

    const std::size_t columnNumber = 5;
    while (in)
    {
        std::vector< std::string > columns(columnNumber);

        for (std::size_t i = 0; i < columnNumber - 1; ++i)
            std::getline(in, columns[i], '\t');
        std::getline(in, columns[columnNumber - 1], '\n');

        std::cout << columns[3] << std::endl;
    }

    return 0;
}

要将字符串值转换为double,您可以使用以下内容。

double convertToDouble( const std::string& value )
{
    std::stringstream os;
    os << value;
    double result;
    os >> result;
    return result;
}

答案 4 :(得分:1)

最简单的方法是使用两次调用fscanf,scanf或sscanf,如下所示:

std::string line = /* some line */;
if(sscanf(line.c_str(), "%s %f %f %s", &str1, &float1, &float2, &str2) == 4){
    // 4 parameters
}else if(sscanf(line.c_str(), ...) == 5){
    // 5 parameters
}

使用boost :: Spirit看起来有点矫枉过正,尽管这不是最常用的C ++方式。

答案 5 :(得分:1)

又一个版本 - 我认为这是最少打字的版本!

#include <iostream>
#include <sstream>
#include <string>
using namespace std;

int main() {

    string f1, f4;
    double f2, f3, f5;

    string line;
    istringstream is;

    while( getline( cin, line ) ) {

        is.str( line );

        if ( ! (is >> f1 >> f2 >> f3 >> f4 >> f5) ) {
            is.str( line);
            f4 = "*";
            is >> f1 >> f2 >> f3 >> f5;
        }

        cout << f1 << " " << f2 << " " << f3 << " " << f4 << " " << f5 << endl;
    }
}

答案 6 :(得分:1)

一种读取和处理任何基于文本的表的通用解决方案。解决方案是提升。

typedef boost::function< void (int, int, const std::string&) > RecordHandler;
void readTableFromFile( const std::string& fileName,
                        const std::string& delimiter,
                        RecordHandler handler );

void handler(int row, int col, const std::string& value)
{
    std::cout << "[ " << row << ", " << col << "] " << value;
}

int main()
{
    readTableFromFile("parsefile.in", "\t,", handler);

    return 0;
}

实施

std::size_t columnsCountInTheFile( const std::string& fileName,
                                   const std::string& delimiter )
{
    typedef boost::char_separator<char> Separator;
    typedef boost::tokenizer< Separator > Tokenizer;

    std::ifstream in(fileName.c_str());

    std::string line;
    std::getline(in, line);

    Tokenizer t(line,
                Separator(delimiter.c_str(), "", boost::keep_empty_tokens));

    return std::distance(t.begin(), t.end());
}

void readTableFromFile( const std::string& fileName,
                        const std::string& delimiter,
                        RecordHandler handler );
{
    std::ifstream in(fileName.c_str());

    if (!in)
        throw std::runtime_error("can't read from " + fileName);

    typedef std::istreambuf_iterator<char> InputIterator;
    typedef boost::char_separator<char> Separator;
    typedef boost::tokenizer< Separator, InputIterator > Tokenizer;

    Tokenizer tokens(InputIterator(in),
                     InputIterator(),
                     Separator((delimiter + "\n").c_str(), "", boost::keep_empty_tokens));

    const std::size_t columnsCount = columnsCountInTheFile(fileName, delimiter);

    std::size_t columnNumber = 1;
    std::size_t rowNumber = 1;
    for(Tokenizer::iterator it = tokens.begin(); 
        it != tokens.end(); 
        ++it)
    {
        handler(rowNumber, columnNumber, *it);

        if ( columnsCount == columnNumber )
        {
            columnNumber = 1;
            ++rowNumber;
        }
        else
        {
            ++columnNumber;
        }
    }
}