Question

我有一个html文件，其中包含我正在处理的项目的依赖项列表。它具有以下格式：

- 一些HTML

  <p><strong>Module Name:</strong> spring-web</p>
   <p><strong>Module Group:</strong> org.springframework</p>
   <p><strong>Module Version:</strong> 4.2.1.RELEASE</p>

- 更多html

 <p><strong>Module Name:</strong> google-http-client</p>
    <p><strong>Module Group:</strong> com.google.http-client</p>
    <p><strong>Module Version:</strong> 1.19.0</p>

等

我想从这个html文件创建一个csv文件 csv文件将具有每条记录的格式：

模块名称，模块组，模块版本

e.g。谷歌-HTTP客户端，com.google.http的客户端，1.19.0

知道如何使用脚本执行此操作吗？

Answer 1

试一试！

#!/bin/bash
inFile=$1
outFile=$2

join () {
 local del=$1
 shift
 IFS="$del"
 source <(
        cat <<SOURCE
 echo "\${$1[*]}"
SOURCE
 ) 
 unset IFS
}

declare -a CSV=('"Module Name","Module Group","Module Version"')
declare -a keysAccepted=('Name' 'Group' 'Version')

declare -i nMandatoryKeys=${#keysAccepted[@]}
declare -A KeyFilled
rxKeysAccepted='('$(join '|' keysAccepted)')'
while read line; do
        [[ $line =~ \<strong\>Module\ $rxKeysAccepted:\</strong\>[[:space:]]*([^<]+)\</p\> ]] || continue
        key=${BASH_REMATCH[1]}
        val=${BASH_REMATCH[2]}
        KeyFilled[$key]=$val
        if (( ${#KeyFilled[@]} == nMandatoryKeys )); then
                unset csvLine
                for k in ${keysAccepted[@]}; do
                        csvLine+=${csvLine:+,}${KeyFilled[$k]}
                done
                KeyFilled=()
                CSV+=($csvLine)
        fi
done <"$inFile"

(( ${#CSV[@]} > 1 )) || exit 1

join $'\x0a' CSV >"$outFile"

Answer 2

如果您的源文件是一致的（所有三个字段以相同的顺序存在），您可以尝试这个...

$ sed -nr 's_\s*<p><strong>Module (Name|Group|Version):</strong> (.*)</p>_\2_p' file\
  | awk 'ORS=NR%3?",":RS'
spring-web,org.springframework,4.2.1.RELEASE
google-http-client,com.google.http-client,1.19.0

脚本迭代html文件并创建csv文件

2 个答案: