使用R

时间:2016-07-01 12:48:00

标签: r xml

我有以下XML文件,我将使用R从中提取数据。 通常,我将包read_xml中的xml2函数与%>%命令结合使用。但由于某些原因,这不起作用。它甚至没有读取XML。

invoices <- read_xml(doclist[i]) %>% xml_nodes("page")
invoices
{xml_nodeset (0)}

我要提取的数据只是子项<variantText>之后的文本,并存储这是一个数据帧。所以在这个例子中

Klantbetaalnummer
10450320个
Contactgegevens

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<document xmlns="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml" version="1.0" producer="FineReader 10.0" pagesCount="2" languages="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml">
    <page width="2479" height="3508" resolution="300">
        <block blockType="Text" blockName="" l="292" t="108" r="590" b="194"><region><rect l="292" t="108" r="590" b="194"/></region>
            <text>
                <par align="Justified" lineSpacing="1200">
                    <line baseline="138" l="298" t="114" r="584" b="138"><formatting lang="EnglishUnitedStates" ff="Arial" fs="8.">
                            <wordRecVariants>
                                <wordRecVariant wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" wordPenalty="0" meanStrokeWidth="31"><variantText>Klantbetaalnummer<charParams l="0" t="0" r="0" b="0">K</charParams><charParams l="0" t="0" r="0" b="0">l</charParams><charParams l="0" t="0" r="0" b="0">a</charParams><charParams l="0" t="0" r="0" b="0">n</charParams><charParams l="0" t="0" r="0" b="0">t</charParams><charParams l="0" t="0" r="0" b="0">b</charParams><charParams l="0" t="0" r="0" b="0">e</charParams><charParams l="0" t="0" r="0" b="0">t</charParams><charParams l="0" t="0" r="0" b="0">a</charParams><charParams l="0" t="0" r="0" b="0">a</charParams><charParams l="0" t="0" r="0" b="0">l</charParams><charParams l="0" t="0" r="0" b="0">n</charParams><charParams l="0" t="0" r="0" b="0">u</charParams><charParams l="0" t="0" r="0" b="0">m</charParams><charParams l="0" t="0" r="0" b="0">m</charParams><charParams l="0" t="0" r="0" b="0">e</charParams><charParams l="0" t="0" r="0" b="0">r</charParams>
                                    </variantText>
                                </wordRecVariant>
                            </wordRecVariants>
                            <charParams l="298" t="114" r="318" b="138" wordStart="1" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="6" wordPenalty="0" meanStrokeWidth="31">K</charParams>
                            <charParams l="319" t="114" r="322" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="31">l</charParams>
                            <charParams l="326" t="120" r="341" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="16" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">a</charParams>
                            <charParams l="345" t="120" r="359" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">n</charParams>
                            <charParams l="362" t="114" r="370" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="28" wordPenalty="0" meanStrokeWidth="31">t</charParams>
                            <charParams l="373" t="114" r="388" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">b</charParams>
                            <charParams l="391" t="120" r="406" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="40" wordPenalty="0" meanStrokeWidth="31">e</charParams>
                            <charParams l="408" t="114" r="416" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="28" wordPenalty="0" meanStrokeWidth="31">t</charParams>
                            <charParams l="419" t="120" r="434" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="16" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">a</charParams>
                            <charParams l="437" t="120" r="452" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="16" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">a</charParams>
                            <charParams l="457" t="114" r="460" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="31">l</charParams>
                            <charParams l="464" t="120" r="478" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">n</charParams>
                            <charParams l="483" t="120" r="497" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="29" serifProbability="0" wordPenalty="0" meanStrokeWidth="31">u</charParams>
                            <charParams l="501" t="120" r="524" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="3" wordPenalty="0" meanStrokeWidth="31">m</charParams>
                            <charParams l="529" t="120" r="552" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="3" wordPenalty="0" meanStrokeWidth="31">m</charParams>
                            <charParams l="556" t="120" r="571" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="40" wordPenalty="0" meanStrokeWidth="31">e</charParams>
                            <charParams l="575" t="120" r="584" b="138" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="16" serifProbability="4" wordPenalty="0" meanStrokeWidth="31">r</charParams></formatting><formatting lang="EnglishUnitedStates" ff="Times New Roman" fs="10."></formatting></line>
                    <line baseline="188" l="298" t="164" r="441" b="188"><formatting lang="EnglishUnitedStates" ff="Arial" fs="8." bold="1">
                            <wordRecVariants>
                                <wordRecVariant wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" wordPenalty="0" meanStrokeWidth="50"><variantText>10450320<charParams l="0" t="0" r="0" b="0">1</charParams><charParams l="0" t="0" r="0" b="0">0</charParams><charParams l="0" t="0" r="0" b="0">4</charParams><charParams l="0" t="0" r="0" b="0">5</charParams><charParams l="0" t="0" r="0" b="0">0</charParams><charParams l="0" t="0" r="0" b="0">3</charParams><charParams l="0" t="0" r="0" b="0">2</charParams><charParams l="0" t="0" r="0" b="0">0</charParams>
                                    </variantText>
                                </wordRecVariant>
                            </wordRecVariants>
                            <charParams l="298" t="164" r="309" b="188" wordStart="1" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="46" serifProbability="67" wordPenalty="0" meanStrokeWidth="50">1</charParams>
                            <charParams l="315" t="164" r="330" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">0</charParams>
                            <charParams l="332" t="164" r="349" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">4</charParams>
                            <charParams l="352" t="164" r="367" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="44" wordPenalty="0" meanStrokeWidth="50">5</charParams>
                            <charParams l="370" t="164" r="385" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">0</charParams>
                            <charParams l="389" t="164" r="404" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="89" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">3</charParams>
                            <charParams l="407" t="164" r="422" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">2</charParams>
                            <charParams l="426" t="164" r="441" b="188" wordStart="0" wordFromDictionary="0" wordNormal="0" wordNumeric="1" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">0</charParams></formatting></line></par>
            </text>
        </block>
        <block blockType="Text" blockName="" l="1826" t="383" r="2113" b="426"><region><rect l="1826" t="383" r="2113" b="426"/></region>
            <text>
                <par align="Justified">
                    <line baseline="413" l="1832" t="389" r="2107" b="420"><formatting lang="EnglishUnitedStates" ff="Arial" fs="8." bold="1">
                            <wordRecVariants>
                                <wordRecVariant wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" wordPenalty="0" meanStrokeWidth="50"><variantText>Contactgegevens<charParams l="0" t="0" r="0" b="0">C</charParams><charParams l="0" t="0" r="0" b="0">o</charParams><charParams l="0" t="0" r="0" b="0">n</charParams><charParams l="0" t="0" r="0" b="0">t</charParams><charParams l="0" t="0" r="0" b="0">a</charParams><charParams l="0" t="0" r="0" b="0">c</charParams><charParams l="0" t="0" r="0" b="0">t</charParams><charParams l="0" t="0" r="0" b="0">g</charParams><charParams l="0" t="0" r="0" b="0">e</charParams><charParams l="0" t="0" r="0" b="0">g</charParams><charParams l="0" t="0" r="0" b="0">e</charParams><charParams l="0" t="0" r="0" b="0">v</charParams><charParams l="0" t="0" r="0" b="0">e</charParams><charParams l="0" t="0" r="0" b="0">n</charParams><charParams l="0" t="0" r="0" b="0">s</charParams>
                                    </variantText>
                                </wordRecVariant>
                            </wordRecVariants>
                            <charParams l="1832" t="389" r="1853" b="413" wordStart="1" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="51" wordPenalty="0" meanStrokeWidth="50">C</charParams>
                            <charParams l="1856" t="395" r="1874" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="255" wordPenalty="0" meanStrokeWidth="50">o</charParams>
                            <charParams l="1877" t="395" r="1893" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="50">n</charParams>
                            <charParams l="1895" t="389" r="1905" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="33" serifProbability="44" wordPenalty="0" meanStrokeWidth="50">t</charParams>
                            <charParams l="1908" t="395" r="1924" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="50">a</charParams>
                            <charParams l="1926" t="395" r="1942" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="51" wordPenalty="0" meanStrokeWidth="50">c</charParams>
                            <charParams l="1944" t="389" r="1954" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="33" serifProbability="44" wordPenalty="0" meanStrokeWidth="50">t</charParams>
                            <charParams l="1956" t="395" r="1973" b="420" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="12" wordPenalty="0" meanStrokeWidth="50">g</charParams>
                            <charParams l="1976" t="395" r="1992" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="39" wordPenalty="0" meanStrokeWidth="50">e</charParams>
                            <charParams l="1995" t="395" r="2012" b="420" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="12" wordPenalty="0" meanStrokeWidth="50">g</charParams>
                            <charParams l="2015" t="395" r="2031" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="39" wordPenalty="0" meanStrokeWidth="50">e</charParams>
                            <charParams l="2033" t="395" r="2050" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="5" wordPenalty="0" meanStrokeWidth="50">v</charParams>
                            <charParams l="2052" t="395" r="2068" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="39" wordPenalty="0" meanStrokeWidth="50">e</charParams>
                            <charParams l="2072" t="395" r="2088" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="0" wordPenalty="0" meanStrokeWidth="50">n</charParams>
                            <charParams l="2091" t="395" r="2107" b="413" wordStart="0" wordFromDictionary="0" wordNormal="1" wordNumeric="0" wordIdentifier="0" charConfidence="100" serifProbability="57" wordPenalty="0" meanStrokeWidth="50">s</charParams></formatting></line></par>
            </text>
        </block>
    </page>
</document>

2 个答案:

答案 0 :(得分:0)

我没有看过为什么你的xml没有被读取,但另一种解决方案是使用正则表达式。

library(stringr)

str_match(doclist, "<variantText>(.*)</variantText>")

答案 1 :(得分:0)

您的文档具有与之关联的命名空间,因此您需要在路径中指定命名空间。试试这个:

$result = 0;
for ($i=0; $i < 4; $i++) {
    $result += mt_rand(1, 6) * 10 ** $i;

    // or for PHP versions < 5.6 (no ** exponentiation operator)
    // $result += mt_rand(1, 6) * pow(10, $i); 
}