通过选项卡迭代然后使用C ++管道分隔文件

时间:2017-10-17 19:45:46

标签: c++11

我有一个来自程序的tsv文件,但我有一个问题,他们将不同的信息放在由管道符号分隔的一个单元格中。

XP_017347145.1    GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818
XP_017347145.1    GO:0003677|GO:0004003|GO:0005524
XP_017347145.1    GO:0005524
XP_017347145.1    GO:0004003|GO:0016818
XP_017347145.1    GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818
XP_017350967.1    GO:0005515

我想将它转换为如下所示的两列,但似乎我不了解如何在C ++中使用getline()函数。

我的经验不是很好,但输出应该如下所示:

XP_017347145.1 = GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017347145.1 = GO:0003677
XP_017347145.1 = GO:0004003
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0004003
XP_017347145.1 = GO:0016818
XP_017347145.1 = GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017350967.1 = GO:0005515

我目前在C ++中的代码失败并错过了某些地方的等号,而是返回一个标签。

#include <fstream>
#include <iostream>
#include <sstream>
#include <string>

int main() {

    using namespace std;
    string stringIn;
    string stringOut;
    string value;
    string value2;

    cout << "Input the name of the file: " << endl;
    getline(cin, stringIn);
    cout << "The output file name is " << endl;
    getline(cin, stringOut);

    ifstream inputFile(stringIn);
    ofstream outputFile(stringOut);

    // Let the user know if the file exists
    if (!inputFile) {
        cout << "Cannot open input file" << endl;
    }

    if (!outputFile) {
        cout << "Can not save output file" << endl;
    }

    // It should iterate through the values using column
    // and column2 delimited by the pipe sign.
    // For example, GO:0005524|GO:0008026 and this could be of unknown length.
    while (getline(inputFile,value,'\t')) {
        while (getline(inputFile,value2,'|')) {
            outputFile << value + " = " + value2 << endl;
        }
    }

    outputFile.close();
    inputFile.close();
    cin.get();

    return 0;
}

我当前的代码返回以下输出和数据,如下所示。任何建议将不胜感激。

XP_017347145.1 = GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017347145.1    GO:0003677
XP_017347145.1 = GO:0004003
XP_017347145.1 = GO:0005524
XP_017347145.1    GO:0005524
XP_017347145.1    GO:0004003
XP_017347145.1 = GO:0016818
XP_017347145.1    GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017350967.1    GO:0005515

2 个答案:

答案 0 :(得分:2)

您可以使用sregex_token_iterator解决问题,如:

    std::regex re("\\s+|\\|");
    sregex_token_iterator reg_end;
    while (getline(inputFile,value)) {
        sregex_token_iterator it(value.begin(), value.end(), re, -1);
        std::string p1 =  (it++)->str();
        for (; it != reg_end; ++it) {
            outputFile << p1  << " = " << it->str() << endl;
        }
   }

答案 1 :(得分:1)

出现此问题是因为getline(inputFile,value2,'|')正在捕获以下内容:

GO:0016818\nXP_017347145.1\tGO:0003677
           ^
           |
           |
       newline captured

然后它打印出没有等号的整个记录​​,因为它是以前捕获的value2的一部分。

最好使用默认的getline(inputFile,line)换行符分隔线为每一行执行\n。然后使用line创建std::stringstream ss{line},然后最终运行getline(ss,value2,'|')

另外,我正在使用正则表达式,我认为以下可能是更优雅和通用的解决方案:

#include <iostream>
#include <regex>
#include <sstream>
#include <string>
#include <algorithm>
#include <vector>

std::stringstream input{R"(XP_017347145.1  GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818
XP_017347145.1  GO:0003677|GO:0004003|GO:0005524
XP_017347145.1  GO:0005524
XP_017347145.1  GO:0004003|GO:0016818
XP_017347145.1  GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818
XP_017350967.1  GO:0005515)"}; 

struct Record{
    std::string xp;
    std::string go;
};

std::ostream& operator<<(std::ostream& os, const Record& r)
{
    return os << "XP_" << r.xp << " = GO:" << r.go << '\n';
}

int main()
{
    std::vector<Record> records;
    for(std::string line; getline(input, line);) {
        std::regex r{R"(^XP_(\d*\.\d))"}; // match xp
        std::smatch m;
        if(std::regex_search(line, m, r)){
            auto xp = m[1].str();
            std::regex go_r{R"(GO:(\d*)\|?)"}; // match go
            auto begin = std::sregex_iterator{line.begin(), line.end(), go_r};
            auto end = std::sregex_iterator{};
            std::for_each(begin, end, [&records, &xp](const auto& i){records.emplace_back(Record{xp, i[1].str()}); });
        }
    }
    for(const auto& i : records)
        std::cout << i;
}

输出:

XP_017347145.1 = GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017347145.1 = GO:0003677
XP_017347145.1 = GO:0004003
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0004003
XP_017347145.1 = GO:0016818
XP_017347145.1 = GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017350967.1 = GO:0005515