在R-查询Clob列中提取XML sqlQuery问题

时间:2020-10-14 19:14:54

标签: sql r xml oracle xml-parsing

我有一个名为CRS.CRS_FILES的oracle数据库表,其中有一个名为FILE_DATA的列-其中CLOB列是一个大型XML字符串。

FILE_DATA   FILE_CREATION_DATE
<?xml version="1.0" encoding="utf-8"?><REPORT   1/1/2020
<?xml version="1.0" encoding="utf-8"?><REPORT   1/5/2020
<?xml version="1.0" encoding="utf-8"?><REPORT   1/6/2019
<?xml version="1.0" encoding="utf-8"?><REPORT   1/1/2020
<?xml version="1.0" encoding="utf-8"?><REPORT   1/5/2020

以下是其中的前几行:

<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
<CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
<AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
<AGENCYNAME>Milwaukee Police Department</AGENCYNAME>

使用以下我要查询的Xpath进行设置:

//REPORT/AGENCYIDENTIFIER

query_string2 <- "SELECT
XMLTYPE(t.FILE_DATA).EXTRACT('//REPORT/AGENCYNAME/text()').getClobVal()
FROM CRS.CRS_FILES t"
idtable <- sqlQuery(ch,query_string2, max=10)

query_string2 <- "SELECT
XMLTYPE(t.FILE_DATA).EXTRACT('//REPORT/AGENCYNAME/text()').getStringVal()
FROM CRS.CRS_FILES t"
idtable <- sqlQuery(ch,query_string2, max=10)

我不确定我在做什么-我知道sqlQuery在传递SQL查询时存在一些较小的格式问题,但是无论我如何尝试,我的结果都将如下所示:

XMLTYPE(T.FILE_DATA).EXTRACT('//REPORT/AGENCYNAME/TEXT()').GETCLOBVAL()
1   NA
2   NA
3   NA
4   NA
5   NA
6   NA
7   NA
8   NA
9   NA
10  NA

我在做什么错?我只想提取密尔沃基警察局的价值(见下文)(当然,我会将col重命名为AGENCYNAME之类的名称)

XMLTYPE(T.FILE_DATA).EXTRACT('//REPORT/AGENCYNAME/TEXT()').GETCLOBVAL()
1   Milwaukee Police Department
2   Milwaukee Police Department
3   Milwaukee Police Department
4   Milwaukee Police Department
5   Milwaukee Police Department
6   Milwaukee Police Department
7   Milwaukee Police Department
8   Milwaukee Police Department
9   Milwaukee Police Department
10  Milwaukee Police Department

2 个答案:

答案 0 :(得分:2)

EXTRACT(xml) function已过时。而是使用XMLTABLE

SELECT x.agencyname
FROM   CRS.CRS_FILES c
       CROSS JOIN XMLTABLE(
         XMLNAMESPACES(
           'http://www.w3.org/2001/XMLSchema-instance' AS "i",
           DEFAULT 'http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201'
         ),
         '/REPORT'
         PASSING XMLTYPE( c.file_data )
         COLUMNS
           crsreporttimestamp TIMESTAMP     PATH 'CRSREPORTTIMESTAMP',
           agencyidentifier   VARCHAR2(50)  PATH 'AGENCYIDENTIFIER',
           agencyname         VARCHAR2(100) PATH 'AGENCYNAME'
       ) x

或者,在R中,转义的双引号应该相同:

query_string2 <- "SELECT x.agencyname
FROM   CRS.CRS_FILES c
       CROSS JOIN XMLTABLE(
         XMLNAMESPACES(
           'http://www.w3.org/2001/XMLSchema-instance' AS \"i\",
           DEFAULT 'http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201'
         ),
         '/REPORT'
         PASSING XMLTYPE( c.file_data )
         COLUMNS
           crsreporttimestamp TIMESTAMP     PATH 'CRSREPORTTIMESTAMP',
           agencyidentifier   VARCHAR2(50)  PATH 'AGENCYIDENTIFIER',
           agencyname         VARCHAR2(100) PATH 'AGENCYNAME'
       ) x"

idtable <- sqlQuery(ch,query_string2, max=10)

其中,用于您的测试数据:

CREATE TABLE CRS.CRS_FILES ( FILE_DATA CLOB );

INSERT INTO CRS.crs_files VALUES (
'<?xml version="1.0" encoding="utf-8" ?>
<REPORT xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201">
  <CRSREPORTTIMESTAMP>2020-10-08T06:49:31.813812</CRSREPORTTIMESTAMP>-
  <AGENCYIDENTIFIER>MILWAUKEE</AGENCYIDENTIFIER>-
  <AGENCYNAME>Milwaukee Police Department</AGENCYNAME>
</REPORT>'
)

输出:

| AGENCYNAME                  |
| :-------------------------- |
| Milwaukee Police Department |

如果您确实想使用EXTRACT,则需要指定XML名称空间:

SELECT XMLTYPE(t.FILE_DATA).EXTRACT(
         '//REPORT/AGENCYNAME/text()',
         'xmlns="http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201"'
       ).getStringVal() AS agencyname
FROM   CRS.CRS_FILES t

输出:

| AGENCYNAME                  |
| :-------------------------- |
| Milwaukee Police Department |

db <>提琴here

答案 1 :(得分:1)

当前的Oracle查询是问题所在,而不是RODBC::sqlQuery方法。简而言之,您的XPath并未考虑根节点中的默认名称空间。但是,XMLType extract()函数允许您定义一个临时前缀以便在XPath中使用:

extract(XMLType_instance IN XMLType, 
        XPath_string IN VARCHAR2, 
        namespace_string In VARCHAR2 := NULL) RETURN XMLType;

因此,一旦定义了前缀doc即可将其应用于XPath:

query_string2 <- "SELECT XMLTYPE(t.FILE_DATA).EXTRACT('//doc:REPORT/doc:AGENCYNAME/text()',
                           'xmlns:doc=\"http://schemas.datacontract.org/2004/07/CrashReport.DataLayer.v20170201\"').getStringVal()
                  FROM CRS.CRS_FILES t"

idtable <- sqlQuery(ch,query_string2, max=10)

Online Demo (适用于getClobValgetStringVal