Question

我有一些使用CasperJS和PhantomJS从在线资源中删除的文本数据。数据结构合理，包含许多不需要的数据，例如：

bla bla bla text
more bla bla bla
John Smith
article
April 25 at 5:00pm · 
5 tests, 6 thumbs up, bla bla bla text

John Smith
another good article
April 25 at 6:00pm · 
3 tests, 4 thumbs-up, some more bla bla bla text
John Smith
another article
April 25 at 9:00pm · 
7 tests, 8 thumbs-up, and even more bla bla bla text

lots of bla bla bla text

我需要提取的内容仅包括以下字段：

姓名，例如约翰史密斯
文章，例如另一篇文章
日期，例如4月25日晚上9点
测试次数，例如7次测试
竖起大拇指的数量，例如8竖起大拇指

很明显，它们作为固定模式重复出现，其中约翰史密斯是一个拥有许多文章的人，每个文章都有不同的属性（文章内容，日期，测试次数和竖起大拇指）。根据以下XSD，我需要提取的数据包含大量垃圾以放入XML格式：

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="author">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="name" type="xs:string"/>
        <xs:element ref="article" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="article">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="content" type="xs:string"/>
        <xs:element name="issueDate" type="xs:dateTime"/>
        <xs:element name="tests" type="xs:integer"/>
        <xs:element name="thumbsup" type="xs:integer"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

有什么方法可以使用某种bash或xml映射器或任何其他实用程序来实现它？

非常感谢。

从结构合理的文本文件中提取数据并映射到XML

0 个答案: