如何检测文本文件中的重复模式

时间:2019-08-06 10:32:51

标签: parsing text-files

要简化具有多个术语的功能,可以使用一个程序在文件中搜索模式并将它们排列在排名列表中。我可以想象这是一个复杂的过程,但是我敢肯定有些人已经建立了这样的东西。 文本示例:

sin(t1)*cos(t1)*t1+t1-sin(t1)*sin(t1-pi)

这应该给我这样的输出(至少2个字母):

 6x: "t1" 
 4x: "(t1"
 3x: "n(t1"
 3x: "sin"
 3x: "sin("
 2x: "sin(t1)"

 etc.

这个问题有名字吗(我不知道)?是否有已知的算法可以为我解决问题?

1 个答案:

答案 0 :(得分:0)

我用QT编写了一个小程序,可以完成任务。方法是尝试一切。要解决我的问题,可能需要几天的时间,因为文本文件很大。 如果我将以下文本作为输入(“ text.txt”):

sin(t1)*cos(t1)*t1+t1-sin(t1)*sin(t1-pi)

我有以下参数:长度2-5,最小出现次数:3

以下结果:

t1 6
(t 4
(t1 4
si 3
in 3
n( 3
1) 3
)* 3
sin 3
in( 3
n(t 3
t1) 3
1)* 3
sin( 3
in(t 3
n(t1 3
(t1) 3
t1)* 3
sin(t 3
in(t1 3
(t1)* 3

代码:

#include <QCoreApplication>
#include <qdebug.h>
#include <qstring.h>
#include <qfile.h>
#include <qtextstream.h>

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    QString * wholefile = new QString;
    uint64_t minchar = 2;
    uint64_t maxchar = 5;
    uint64_t min_occur = 3;

    QFile file("text.txt");
    if(!file.open(QIODevice::ReadOnly)) {
        qDebug()<<"error reading file";
    }
    QTextStream in(&file);
    while(!in.atEnd()) {
        QString line = in.readLine();
        wholefile->append(line);
    }

    file.close();
    QStringList * allpatterns = new QStringList;
    for(uint64_t i=minchar; i<=maxchar;i++){
        for(uint64_t pos=0; pos<wholefile->length()-i;pos++){
            QString pattern = wholefile->mid(pos,i);
            if(allpatterns->contains(pattern)==0){
                allpatterns->append(pattern);
            }
        }
    }

    uint64_t * strcnt = new uint64_t[allpatterns->length()];
    uint64_t maximum_cnt = 0;
    QStringList * interestingpatterns = new QStringList;
    uint64_t nr_of_patterns = 0;
    for(uint64_t i=0; i<allpatterns->length();i++){
        QString str = allpatterns->at(i);
        strcnt[nr_of_patterns] = wholefile->count(str);
        if(strcnt[nr_of_patterns]>=min_occur){
            if(strcnt[nr_of_patterns]>maximum_cnt){
                maximum_cnt = strcnt[nr_of_patterns];
            }
            interestingpatterns->append(str);
            nr_of_patterns++;
        }
    }
    /* display result*/
    QFile file2("out.txt");
    if (!file2.open(QIODevice::WriteOnly | QIODevice::Text))
        qDebug()<<"error writing file";
    QTextStream out(&file2);

    uint64_t current_max = maximum_cnt;
    while(current_max>=min_occur){
        for(uint64_t i=0; i<interestingpatterns->length();i++){
            if(strcnt[i]==current_max){
                QString str = interestingpatterns->at(i);
                qDebug()<<str<<strcnt[i];
                out <<str<<" "<< QString::number(strcnt[i])<<"\n";
            }
        }
        current_max--;
    }
    file2.close();

    return a.exec();
}