我有一个来自程序的tsv文件,但我有一个问题,他们将不同的信息放在由管道符号分隔的一个单元格中。
XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818
XP_017347145.1 GO:0003677|GO:0004003|GO:0005524
XP_017347145.1 GO:0005524
XP_017347145.1 GO:0004003|GO:0016818
XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818
XP_017350967.1 GO:0005515
我想将它转换为如下所示的两列,但似乎我不了解如何在C ++中使用getline()函数。
我的经验不是很好,但输出应该如下所示:
XP_017347145.1 = GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017347145.1 = GO:0003677
XP_017347145.1 = GO:0004003
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0004003
XP_017347145.1 = GO:0016818
XP_017347145.1 = GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017350967.1 = GO:0005515
我目前在C ++中的代码失败并错过了某些地方的等号,而是返回一个标签。
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
int main() {
using namespace std;
string stringIn;
string stringOut;
string value;
string value2;
cout << "Input the name of the file: " << endl;
getline(cin, stringIn);
cout << "The output file name is " << endl;
getline(cin, stringOut);
ifstream inputFile(stringIn);
ofstream outputFile(stringOut);
// Let the user know if the file exists
if (!inputFile) {
cout << "Cannot open input file" << endl;
}
if (!outputFile) {
cout << "Can not save output file" << endl;
}
// It should iterate through the values using column
// and column2 delimited by the pipe sign.
// For example, GO:0005524|GO:0008026 and this could be of unknown length.
while (getline(inputFile,value,'\t')) {
while (getline(inputFile,value2,'|')) {
outputFile << value + " = " + value2 << endl;
}
}
outputFile.close();
inputFile.close();
cin.get();
return 0;
}
我当前的代码返回以下输出和数据,如下所示。任何建议将不胜感激。
XP_017347145.1 = GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017347145.1 GO:0003677
XP_017347145.1 = GO:0004003
XP_017347145.1 = GO:0005524
XP_017347145.1 GO:0005524
XP_017347145.1 GO:0004003
XP_017347145.1 = GO:0016818
XP_017347145.1 GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017350967.1 GO:0005515
答案 0 :(得分:2)
您可以使用sregex_token_iterator解决问题,如:
std::regex re("\\s+|\\|");
sregex_token_iterator reg_end;
while (getline(inputFile,value)) {
sregex_token_iterator it(value.begin(), value.end(), re, -1);
std::string p1 = (it++)->str();
for (; it != reg_end; ++it) {
outputFile << p1 << " = " << it->str() << endl;
}
}
答案 1 :(得分:1)
出现此问题是因为getline(inputFile,value2,'|')
正在捕获以下内容:
GO:0016818\nXP_017347145.1\tGO:0003677
^
|
|
newline captured
然后它打印出没有等号的整个记录,因为它是以前捕获的value2
的一部分。
最好使用默认的getline(inputFile,line)
换行符分隔线为每一行执行\n
。然后使用line
创建std::stringstream ss{line}
,然后最终运行getline(ss,value2,'|')
。
另外,我正在使用正则表达式,我认为以下可能是更优雅和通用的解决方案:
#include <iostream>
#include <regex>
#include <sstream>
#include <string>
#include <algorithm>
#include <vector>
std::stringstream input{R"(XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818
XP_017347145.1 GO:0003677|GO:0004003|GO:0005524
XP_017347145.1 GO:0005524
XP_017347145.1 GO:0004003|GO:0016818
XP_017347145.1 GO:0003676|GO:0005524|GO:0006139|GO:0008026|GO:0016818
XP_017350967.1 GO:0005515)"};
struct Record{
std::string xp;
std::string go;
};
std::ostream& operator<<(std::ostream& os, const Record& r)
{
return os << "XP_" << r.xp << " = GO:" << r.go << '\n';
}
int main()
{
std::vector<Record> records;
for(std::string line; getline(input, line);) {
std::regex r{R"(^XP_(\d*\.\d))"}; // match xp
std::smatch m;
if(std::regex_search(line, m, r)){
auto xp = m[1].str();
std::regex go_r{R"(GO:(\d*)\|?)"}; // match go
auto begin = std::sregex_iterator{line.begin(), line.end(), go_r};
auto end = std::sregex_iterator{};
std::for_each(begin, end, [&records, &xp](const auto& i){records.emplace_back(Record{xp, i[1].str()}); });
}
}
for(const auto& i : records)
std::cout << i;
}
输出:
XP_017347145.1 = GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017347145.1 = GO:0003677
XP_017347145.1 = GO:0004003
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0004003
XP_017347145.1 = GO:0016818
XP_017347145.1 = GO:0003676
XP_017347145.1 = GO:0005524
XP_017347145.1 = GO:0006139
XP_017347145.1 = GO:0008026
XP_017347145.1 = GO:0016818
XP_017350967.1 = GO:0005515