将非结构化数据转换为结构化数据

时间:2017-04-28 13:29:22

标签: vba excel-vba access-vba text-parsing excel

早上好,

我试图转换一些数据,如下所示。

---------------------Page 1---------------------

Class Sessions Detail Report

Course Number: CRS0001290                       Trainer:                                      Location:
Course Version: 1                               Begin:    1/1/2017 12:59 PM                   Capacity:     250
Document Version:                               End:      1/1/2017 12:59 PM                   Total Enrolled:    225

lastname, 1st name             PSN0001004                                Academy                                  Enrolled

lastname, 1st name                  PSN0001005                                Academy                                  Enrolled


Page        1/83                                                                                              Wednesday, April 26, 2017
---------------------Page 2---------------------

Class Sessions Detail Report

Course Number: CRS0001290                        Trainer:                                       Location:
Course Version: 1                                Begin:     1/1/2017 12:59 PM                   Capacity:     250
Document Version:                                End:       1/1/2017 12:59 PM                   Total Enrolled:    225

在编号225之后,列出了另一个受训人员名单。这反复重复。

理想情况下,我希望格式按列COURSE,NAME,ID和STATUS分解。部门是不必要的 我有一点Visual Basic经验,所以这可能是尝试这个的最佳语言。

最后,结果如下:

(打开链接到.csv)https://drive.google.com/file/d/0Bzvy0h4-5229ZFY5Qk5BRm1WX1E/view?usp=sharing

-Al

1 个答案:

答案 0 :(得分:0)

我的Visual Basic太生锈了,无法为您提供工作代码,但是这里有一些伪代码可以为您提供一个起点:

for each $line in the file:
    if $line is blank
        or $line starts with "---------------------Page"
        or $line starts with "Class Sessions Detail Report"
        or $line starts with "Page        "
    then:
        # ignore that line

    else if $line starts with "Course Number: " then:
        $course = the string of non-blank characters following "Course Number: "

    else if $line starts with "Course Version:" then:
        $start = the string of characters after "Begin:"

    else if $line starts with "Document Version:" then:
        $end = the string of characters after "End:"

    else:
        # It's a line that has information about a trainee
        Split $line into $fields.
        # e.g., if the fields are tab-delimited, then split on tab characters

        # Extract the fields you're interested in:
        $name   = $fields[1]
        $id     = $fields[2]
        $status = $fields[4]

        # And then output the fields you want:
        print $course, $name, $id, $start, $end, $status

    end if
end for