Question

我有很多文件的数据集。每个文件都包含许多由空行分隔的类型评论：

<Author>bigBob
<Content>definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES
<Date>Jan 2, 2009
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/>
<No. Reader>-1
<No. Helpful>-1
<Overall>4
<Value>4
<Rooms>4
<Location>4
<Cleanliness>5
<Check in / front desk>4
<Service>3
<Business service>4

<Author>rickMN... next review goes on

对于每次审核，我需要在标记之后提取数据并将其放入类似的内容中（我计划将其写入.sql文件，因此当我执行“.read”时，它将填充我的数据库）：

INSERT INTO [HotelReviews] ([Author], [Content], [Date], [Image], [No_Reader], [No_Helpful], [Overall], [Value], [Rooms], [Location], [Cleanliness], [Check_In], [Service], [Business_Service]) VALUES ('bigBob', 'definitely above...', ...)

我的问题是如何在每个标记之后提取数据并使用bash将其放入insert语句中？

修改 <Content>标记后的文字通常是带有多行的段落

Answer 1

示例：

#!/bin/bash

while IFS= read -r line; do
  [[ $line =~ ^\<Author\>(.*) ]] && Author="${BASH_REMATCH[1]}"
  [[ $line =~ ^\<Content\>(.*) ]] && Content="${BASH_REMATCH[1]}"

  # capture lines not starting with < and append to variable Content
  [[ $line =~ ^[^\<] ]] && Content+="$line"

  # match an empty line
  [[ $line =~ ^$ ]] && echo "${Author}, ${Content}"
done < file

使用您的文件输出：

bigBob, definitely above average! we had a really nice stay there last year when I and ...

=~：匹配正则表达式（字符串左，正则表达式右边没有引号）

^：匹配行的开头

\<或\>：匹配<或>

.*：此处匹配其余行

(.*)：将其余行捕获到数组BASH_REMATCH
的第一个元素

请参阅：The Stack Overflow Regular Expressions FAQ

Answer 2

这是您正在尝试做的正确方法：

$ cat tst.awk
NF {
    if ( match($0,/^<img\s+src="([^"]+)/,a) ) {
        name="Image"
        value=a[1]
    }
    else if ( match($0,/^<([^>"]+)>(.*)/,a) )  {
        name=a[1]
        value=a[2]
        sub(/ \/.*|\./,"",name)
        gsub(/ /,"_",name)
    }

    names[++numNames] = name
    values[numNames] = value
    next
}

{ prt() }
END { prt() }

function prt() {
    printf "INSERT INTO [HotelReviews] ("

    for (nameNr=1; nameNr<=numNames; nameNr++) {
        printf " [%s]", names[nameNr]
    }

    printf ") VALUES ("

    for (nameNr=1; nameNr<=numNames; nameNr++) {
        printf " \047%s\047", values[nameNr]
    }

    print ""

    numNames = 0
    delete names
    delete values
}

$ awk -f tst.awk file
INSERT INTO [HotelReviews] ( [Author] [Content] [Date] [Image] [No_Reader] [No_Helpful] [Overall] [Value] [Rooms] [Location] [Cleanliness] [Check_in] [Service] [Business_service]) VALUES ( 'bigBob' 'definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES' 'Jan 2, 2009' 'http://cdn.tripadvisor.com/img2/new.gif' '-1' '-1' '4' '4' '4' '4' '5' '4' '3' '4'
INSERT INTO [HotelReviews] ( [Author]) VALUES ( 'rickMN... next review goes on'

以上使用GNU awk为第3个arg匹配（）。按摩以获得您想要的精确格式/输出。

使用bash脚本从文件中提取数据以填充数据库

2 个答案: