使用具有命名空间

时间:2017-04-21 23:53:27

标签: r xml xpath xml-parsing parsexml

以下是我从sharepoint获得的xml响应 我正在尝试解析数据并获得以下格式的详细信息

需要输出

title port space    datecreat               id
test  8080 100.000 2017-04-21 17:29:23      1
apple  8700 108.000 2017-04-21 18:29:23     2

收到输入

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <soap:Body>
        <GetListItemsResponse xmlns="http://schemas.microsoft.com/sharepoint/soap/">
            <GetListItemsResult>
                <listitems xmlns:s='uuid:SBDSHDSH-DSJHD' xmlns:dt='uuid:CSDSJHA-DGGD' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'
                    <rs:data ItemCount="2">
                        <z:row title="test" port="8080" space='100.000' datecreat='2017-04-21 17:29:23' id='1' />
                        <z:row title="apple" port="8700" space='108.000' datecreat='2017-04-21 17:29:23' id='2' />
                    </rs:data>
                </listitems>
            </GetListItemsResult>
        </GetListItemsResponse>
    </soap:Body>
</soap:Envelope>

我是R的新手并且尝试了很少但没有人工作。名称空间和z:row无法被检测到。

3 个答案:

答案 0 :(得分:1)

假设文字位于Lines,一种方法只是grep z:row行,用空格替换等号并使用read.table读取。第一行读取包括一些垃圾列的行,第二行删除垃圾列并设置列名称。请注意,即使XML无效,这也会起作用。没有包使用。

DF <- read.table(text = gsub("=", " ", grep("z:row", Lines, value = TRUE)))
setNames(DF[seq(3, ncol(DF), 2)], unlist(DF[1, seq(2, ncol(DF)-2, 2)]))

,并提供:

  title port space           datecreat id
1  test 8080   100 2017-04-21 17:29:23  1
2 apple 8700   108 2017-04-21 17:29:23  2

注意:输入假定为:

Lines <- c(" <?xml version=\"1.0\" encoding=\"utf-8\"?>", "        <soap:Envelope xmlns:soap=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\">", 
"            <soap:Body>", "                <GetListItemsResponse xmlns=\"http://schemas.microsoft.com/sharepoint/soap/\">", 
"                    <GetListItemsResult>", "                            <listitems xmlns:s='uuid:SBDSHDSH-DSJHD' xmlns:dt='uuid:CSDSJHA-DGGD' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'", 
"                                <rs:data ItemCount=\"2\">", 
"                                    <z:row title=\"test\" port=\"8080\" space='100.000' datecreat='2017-04-21 17:29:23' id='1' />", 
"                                    <z:row title=\"apple\" port=\"8700\" space='108.000' datecreat='2017-04-21 17:29:23' id='2' />", 
"                            </rs:data>", "                        </listitems>", 
"                    </GetListItemsResult>", "                </GetListItemsResponse>", 
"            </soap:Body>", "        </soap:Envelope>")

相反,如果你的输入是一个名为Lines_n的长换行分隔字符串,那么先运行它:

Lines <- readLines(textConnection(Lines_n))

答案 1 :(得分:1)

考虑使用三重冒号运算符注册FILE *file =fopen("test.txt","r"); //open text file char data[1000]; //assume line max length is 1000 int structCount = 0; //counter so we can assign each line to a new index in struct (places) array while (fgets(data,1000, file)!=NULL) { //split the line by spaces --> each line only has 3 words char *line; line = strtok(data," "); int c = 0; while(line != NULL) { /*each line in file would be in the format NAME WORD1 WORD2 */ if(c == 0) { //first word in the line is the name places[structCount].name = line; } else if(c == 1) { //second word in the line is 'word1' places[structCount].word1 = line; } else { //third word in the line is 'word2' places[structCount].word2 = line; } line = strtok(NULL," "); c++; } structCount++; } fclose(file); //close file 名称空间前缀并使用XML的内部变量z

xmlAttrsToDataframe

答案 2 :(得分:0)

这不是有效的XML,虽然我是第一个抱怨SharePoint的人,但它本身不会产生破坏的东西。一个正在攻击你的SharePoint服务器的同事完全有可能破坏了一些东西,但是很难打破它。

无论如何,这是XML的有效版本:

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <soap:Body>
        <GetListItemsResponse xmlns="http://schemas.microsoft.com/sharepoint/soap/">
            <GetListItemsResult>
                <listitems xmlns:s='uuid:SBDSHDSH-DSJHD' xmlns:dt='uuid:CSDSJHA-DGGD' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'>
                    <rs:data ItemCount="2">
                        <z:row title="test" port="8080" space='100.000' datecreat='2017-04-21 17:29:23' id='1' />
                        <z:row title="apple" port="8700" space='108.000' datecreat='2017-04-21 17:29:23' id='2' />
                    </rs:data>
                </listitems>
            </GetListItemsResult>
        </GetListItemsResponse>
    </soap:Body>
</soap:Envelope>

而且,它解析&amp;提取精细:

library(xml2)

doc <- read_xml("test.xml")

ns <- xml_ns_rename(xml_ns(doc), d1 = "a")

xml_find_all(doc, ".//z:row") %>% 
  map(xml_attrs) %>% 
  map_df(as.list) 

## # A tibble: 2 × 5
##   title  port   space           datecreat    id
##   <chr> <chr>   <chr>               <chr> <chr>
## 1  test  8080 100.000 2017-04-21 17:29:23     1
## 2 apple  8700 108.000 2017-04-21 17:29:23     2