Question

正如标题所说，我需要从某些字段的长文本中提取内容。

我的文字如下

Name: David Jones
Office Address: 148 Hulala Street Date: 24/11/2013
Agent No: 1234,
Address: 259 Yolo Road Start Date: 22/11/2013 Due Date: 29/11/2013
Type: Human Properties: None Ago: 29

我在文中的特定字段中有这些标签

Name, Office Address, Date, Agent No, Address, Type, Properties, Age

我想得到的结果是

Name: 'David Jones',
Office Address: '148 Hulala Street',
Date: '24/11/2013',
Agent No: '1234',
Address: '259 Yolo Road',
Type: 'Human'
Properties: 'None',
Age: ''

已经完全解析了每个字段的内容。 Important thing to note here is the original text can possibly have typo (E.g., Ago instead of Age) and extra fields that do not exist in the list of labels (E.g., Start Date and Due Date do not exist in the label list)。因此代码将忽略任何不匹配的文本，并尝试仅查找匹配的结果。

我试图通过遍历每一行的循环来解决这个问题，检查一行是否包含该字段，并查看该行是否还包含更多字段。

目前我有以下代码。

structure = ['Name','Office Address','Date','Agent No','Address','Type','Properties','Age'];
obj = {};
for (i = 0; i < textLines.length; i++) {
  matchingFields = [];
  for (j = 0; j < structure.length; j++) {
    if (textLines[i].indexOf(structure[j] + ':') !== -1) {
      if (matchingFields.length === 0 && textLines[i].indexOf(structure[j] + ':') === 0) {
        matchingFields.push(structure[j]);
        structure.splice(structure.indexOf(structure[j--]), 1);
      } else if (textLines[i].indexOf(structure[j] + ':') > textLines[i].indexOf(matchingFields[matchingFields.length-1])) {
        matchingFields.push(structure[j]);
        structure.splice(structure.indexOf(structure[j--]), 1);
      }
    }

    for (j = 0; j < matchingFields.length; j++) {
      if (j !== matchingFields.length-1) {
        obj[matchingFields[j]] = textLines[i].slice(textLines[i].indexOf(matchingFields[j]) + matchingFields[j].length, textLines[i].indexOf(matchingFields[j+1]));
      } else {
        obj[matchingFields[j]] = textLines[i].slice(textLines[i].indexOf(matchingFields[j]) + matchingFields[j].length);
      }

      obj[matchingFields[j]] = obj[matchingFields[j]].replace(':', '');
      if (obj[matchingFields[j]].indexOf(' ') === 0) {
        obj[matchingFields[j]] = obj[matchingFields[j]].replace(' ', '');
      }
      if (obj[matchingFields[j]].charAt(obj[matchingFields[j]].length-1) === ' ') {
        obj[matchingFields[j]] = obj[matchingFields[j]].slice(0, obj[matchingFields[j]].length-1);
      }
    }
  }

在某些情况下，它可以正常工作，但'Office Address: '的{{1}}和'Address: '现有值会进入'Office Address:'。此外，代码看起来凌乱和丑陋。也似乎是一种暴力强迫。

我想应该有更好的方法。例如使用正则表达式或类似的东西。但没有外部图书馆。

如果您有任何想法，我会很感激它的分享。

Answer 1

这可能有所帮助：

> a.substr(a.indexOf("Name"), a.indexOf("Office Address")).split(":")
["Name", " David Jones "]

Answer 2

假设属性由换行符分隔，您可以创建一个对象，使用以下方法将每个属性映射到其值：

var str = "Name: David Jones\nOffice Address: 148 Hulala Street\nDate: 24/11/2013\nAgent No: 1234,\nAddress: 259 Yolo Road\\nType: Human Properties: None Age: 29";
var output = {};

str.split(/\n/).forEach(function(item){ 
    var match = (item.match(/([A-Za-z\s]*):\s([A-Za-z0-9\s\/]*)/));
    output[match[1]] = match[2];
});

console.log(output)

如何解析＆amp;将文本内容格式化为对象

2 个答案: