使用bash脚本从文件中提取数据以填充数据库

时间:2017-04-08 09:30:10

标签: database bash awk sed sqlite

我有很多文件的数据集。每个文件都包含许多由空行分隔的类型评论:

<Author>bigBob
<Content>definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES
<Date>Jan 2, 2009
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/>
<No. Reader>-1
<No. Helpful>-1
<Overall>4
<Value>4
<Rooms>4
<Location>4
<Cleanliness>5
<Check in / front desk>4
<Service>3
<Business service>4

<Author>rickMN... next review goes on

对于每次审核,我需要在标记之后提取数据并将其放入类似的内容中(我计划将其写入.sql文件,因此当我执行“.read”时,它将填充我的数据库):

INSERT INTO [HotelReviews] ([Author], [Content], [Date], [Image], [No_Reader], [No_Helpful], [Overall], [Value], [Rooms], [Location], [Cleanliness], [Check_In], [Service], [Business_Service]) VALUES ('bigBob', 'definitely above...', ...)

我的问题是如何在每个标记之后提取数据并使用bash将其放入insert语句中?

修改 <Content>标记后的文字通常是带有多行的段落

2 个答案:

答案 0 :(得分:1)

示例:

#!/bin/bash

while IFS= read -r line; do
  [[ $line =~ ^\<Author\>(.*) ]] && Author="${BASH_REMATCH[1]}"
  [[ $line =~ ^\<Content\>(.*) ]] && Content="${BASH_REMATCH[1]}"

  # capture lines not starting with < and append to variable Content
  [[ $line =~ ^[^\<] ]] && Content+="$line"

  # match an empty line
  [[ $line =~ ^$ ]] && echo "${Author}, ${Content}"
done < file

使用您的文件输出:

bigBob, definitely above average! we had a really nice stay there last year when I and ...
  

=~:匹配正则表达式(字符串左,正则表达式右边没有引号)

     

^:匹配行的开头

     

\<\>:匹配<>

     

.*:此处匹配其余行

     

(.*):将其余行捕获到数组BASH_REMATCH

的第一个元素

请参阅:The Stack Overflow Regular Expressions FAQ

答案 1 :(得分:1)

这是您正在尝试做的正确方法:

$ cat tst.awk
NF {
    if ( match($0,/^<img\s+src="([^"]+)/,a) ) {
        name="Image"
        value=a[1]
    }
    else if ( match($0,/^<([^>"]+)>(.*)/,a) )  {
        name=a[1]
        value=a[2]
        sub(/ \/.*|\./,"",name)
        gsub(/ /,"_",name)
    }

    names[++numNames] = name
    values[numNames] = value
    next
}

{ prt() }
END { prt() }

function prt() {
    printf "INSERT INTO [HotelReviews] ("

    for (nameNr=1; nameNr<=numNames; nameNr++) {
        printf " [%s]", names[nameNr]
    }

    printf ") VALUES ("

    for (nameNr=1; nameNr<=numNames; nameNr++) {
        printf " \047%s\047", values[nameNr]
    }

    print ""

    numNames = 0
    delete names
    delete values
}

$ awk -f tst.awk file
INSERT INTO [HotelReviews] ( [Author] [Content] [Date] [Image] [No_Reader] [No_Helpful] [Overall] [Value] [Rooms] [Location] [Cleanliness] [Check_in] [Service] [Business_service]) VALUES ( 'bigBob' 'definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES' 'Jan 2, 2009' 'http://cdn.tripadvisor.com/img2/new.gif' '-1' '-1' '4' '4' '4' '4' '5' '4' '3' '4'
INSERT INTO [HotelReviews] ( [Author]) VALUES ( 'rickMN... next review goes on'

以上使用GNU awk为第3个arg匹配()。按摩以获得您想要的精确格式/输出。