Question

xml文件：

<head>
  <head2>
    <dict type="abc" file="/path/to/file1"></dict>
    <dict type="xyz" file="/path/to/file2"></dict>
  </head2>
</head>

我需要从中提取文件列表。所以输出将是

/path/to/file1
/path/to/file2

到目前为止，我已成功完成以下任务。

grep "<dict*file=" /path/to/xml.file | awk '{print $3}' | awk -F= '{print $NF}'

Answer 1

根据您的样本快速而肮脏，而不是xml possibilties

from pyspark.sql.types import IntegerType
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler


userFactorsDF= alsmodel.userFactors.select("features")

vecAssembler = VectorAssembler(inputCols=["features"], outputCol="features")
featuresdf = vecAssembler.transform(userFactorsDF)

kmeans = KMeans().setK(2).setSeed(1)
model1 = kmeans.fit(featuresdf)


ERROR

IllegalArgumentException: u'Data type ArrayType(FloatType,false) is not supported.'
---------------------------------------------------------------------------
IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-77-05324b5cde72> in <module>()
      7 vecAssembler = VectorAssembler(inputCols=["features"], outputCol="features")
      8 
----> 9 featuresdf = vecAssembler.transform(userFactorsDF)
     10 
     11 kmeans = KMeans().setK(2).setSeed(1)

现在，我不会在XML上推广这种提取，除非你真的知道你的格式和内容来源（额外字段，转义引号，字符串内容如标记格式......）是一个很大的原因失败和意外结果，没有更合适的工具可用

现在使用自己的脚本

# sed a bit secure
sed -e '/<head>/,/<\/head>/!d' -e '/.*[[:blank:]]file="\([^"]*\)".*/!d' -e 's//\1/' YourFile

# sed in brute force
sed -n 's/.*[[:blank:]]file="\([^"]*\)".*/\1/p' -e 's//\1/' YourFile



# awk quick unsecure using your sample
awk -F 'file="|">' '/<head>/{h=1} /\/head>{h=0} h && /[[:blank:]]file/ { print $2 }' YourFile

不需要使用awk的grep，使用启动模式过滤器#grep "<dict*file=" /path/to/xml.file | awk '{print $3}' | awk -F= '{print $NF}' awk '! /<dict.*file=/ {next} {$0=$3;FS="\"";$0=$0;print $2;FS=OFS}' YourFile
使用不同分隔符（FS）的第二个awk可以在更改FS的同一个脚本内完成，但因为它只在下一次评估时发生（默认情况下是下一行），你可以强制重新评估当前内容的$ 0 = $ 0 in这个案例

Answer 2

使用xmllint解决方案-xpath作为//head/head2/dict/@file

xmllint --xpath "//head/head2/dict/@file" input-xml | awk 'BEGIN{FS="file="}{printf "%s\n%s\n", gensub(/"/,"","g",$2), gensub(/"/,"","g",$3)}'
/path/to/file1
/path/to/file2

遗憾的是，无法提供纯xmllint逻辑，因为想到了，

xmllint --xpath "string(//head/head2/dict/@file)" input-xml

将从两个节点返回file属性，但它只返回第一个实例。

因此添加了我的逻辑与GNU Awk，以提取所需的值，执行

xmllint --xpath "//head/head2/dict/@file" input-xml

将值返回为

file="/path/to/file1" file="/path/to/file2"

在上面的输出中，将字符串去限制器设置为file=并使用gensub()函数删除双引号解决了该要求。

Answer 3

PE [ p erl e verywhere :)]解决方案：

perl -MXML::LibXML -E 'say $_->to_literal for XML::LibXML->load_xml(location=>q{file.xml})->findnodes(q{/head/head2/dict/@file})'

打印

/path/to/file1
/path/to/file2

对于上述内容，您需要安装XML::LibXML模块。

Answer 4

使用xmlstarlet，它将是：

xmlstarlet sel -t -v "//head/head2/dict/@file" -nl input.xml

Answer 5

此命令：

awk -F'[=" ">]' '{print $12}' file

将产生：

/path/to/file1
/path/to/file2

从xml文件中提取字段

5 个答案: