我有很多文件的数据集。每个文件都包含许多由空行分隔的类型评论:
<Author>bigBob
<Content>definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES
<Date>Jan 2, 2009
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/>
<No. Reader>-1
<No. Helpful>-1
<Overall>4
<Value>4
<Rooms>4
<Location>4
<Cleanliness>5
<Check in / front desk>4
<Service>3
<Business service>4
<Author>rickMN... next review goes on
对于每次审核,我需要在标记之后提取数据并将其放入类似的内容中(我计划将其写入.sql文件,因此当我执行“.read”时,它将填充我的数据库):
INSERT INTO [HotelReviews] ([Author], [Content], [Date], [Image], [No_Reader], [No_Helpful], [Overall], [Value], [Rooms], [Location], [Cleanliness], [Check_In], [Service], [Business_Service]) VALUES ('bigBob', 'definitely above...', ...)
我的问题是如何在每个标记之后提取数据并使用bash将其放入insert语句中?
修改
<Content>
标记后的文字通常是带有多行的段落
答案 0 :(得分:1)
示例:
#!/bin/bash
while IFS= read -r line; do
[[ $line =~ ^\<Author\>(.*) ]] && Author="${BASH_REMATCH[1]}"
[[ $line =~ ^\<Content\>(.*) ]] && Content="${BASH_REMATCH[1]}"
# capture lines not starting with < and append to variable Content
[[ $line =~ ^[^\<] ]] && Content+="$line"
# match an empty line
[[ $line =~ ^$ ]] && echo "${Author}, ${Content}"
done < file
使用您的文件输出:
bigBob, definitely above average! we had a really nice stay there last year when I and ...
=~
:匹配正则表达式(字符串左,正则表达式右边没有引号)
^
:匹配行的开头
\<
或\>
:匹配<
或>
.*
:此处匹配其余行的第一个元素
(.*)
:将其余行捕获到数组BASH_REMATCH
答案 1 :(得分:1)
这是您正在尝试做的正确方法:
$ cat tst.awk
NF {
if ( match($0,/^<img\s+src="([^"]+)/,a) ) {
name="Image"
value=a[1]
}
else if ( match($0,/^<([^>"]+)>(.*)/,a) ) {
name=a[1]
value=a[2]
sub(/ \/.*|\./,"",name)
gsub(/ /,"_",name)
}
names[++numNames] = name
values[numNames] = value
next
}
{ prt() }
END { prt() }
function prt() {
printf "INSERT INTO [HotelReviews] ("
for (nameNr=1; nameNr<=numNames; nameNr++) {
printf " [%s]", names[nameNr]
}
printf ") VALUES ("
for (nameNr=1; nameNr<=numNames; nameNr++) {
printf " \047%s\047", values[nameNr]
}
print ""
numNames = 0
delete names
delete values
}
$ awk -f tst.awk file
INSERT INTO [HotelReviews] ( [Author] [Content] [Date] [Image] [No_Reader] [No_Helpful] [Overall] [Value] [Rooms] [Location] [Cleanliness] [Check_in] [Service] [Business_service]) VALUES ( 'bigBob' 'definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES' 'Jan 2, 2009' 'http://cdn.tripadvisor.com/img2/new.gif' '-1' '-1' '4' '4' '4' '4' '5' '4' '3' '4'
INSERT INTO [HotelReviews] ( [Author]) VALUES ( 'rickMN... next review goes on'
以上使用GNU awk为第3个arg匹配()。按摩以获得您想要的精确格式/输出。