正则表达式 - 无法从文件创建的字符串中获取HTML表格行,但可以使用代码

时间:2017-03-19 18:39:17

标签: html regex parsing

我试图从html文件中提取所有表行。我将HTML文件读入一个字符串然后解析它。当我解析该字符串时,它永远不会找到任何表行,但是当我使用完全相同的正则表达式从与文件具有相同内容的字符串中找到它时,它可以工作。

我已经附上说明问题的操场。我不知道为什么它适用于我的硬编码字符串(我从文件中复制),但不会从读取文件创建的字符串。

注意:文件的内容与我在代码中使用的字符串完全相同。如果有人可以告诉我如何将文件附加到问题中,我将附加整个操场文件

有什么想法吗?

import Foundation

 extension String
{
    func captureGroups(withRegex pattern: String, withStartPos startPos: inout Int) -> [String]
    {
        var results = [String]()

        var regex: NSRegularExpression

        // NSRegularExpression throws exception if error so I need to trap that
        do {
            regex = try NSRegularExpression(pattern: pattern, options: [])
        }
        catch {
            return results
        }

        let matches = regex.matches(in: self, options: [],
                                    range: NSRange(location:startPos, length: self.characters.count-startPos))

        // Reset the string position to be the end od the currently matched expression
        // This allows me to find the next thing in the string from where I left off
        if let posFound = matches.first?.range.location
        {
            startPos = posFound + matches.first!.range.length   // Start at end of last
        }

        guard let match = matches.first
        else { return results }

        let lastRangeIndex = match.numberOfRanges - 1
        guard lastRangeIndex >= 1
        else { return results }

        for i in 1...lastRangeIndex {
            let capturedGroupIndex = match.rangeAt(i)
            let matchedString = (self as NSString).substring(with: capturedGroupIndex)
            results.append(matchedString)
        }

        return results
    }
} // extenstion String

var contents = ""
let path = Bundle.main.path(forResource: "testTR", ofType: "html")!

do {
    contents = try String(contentsOfFile: path)
    print("CONTENTS: \(contents)")
}
catch {
    print("file not found")
}

var myStartPos: Int = 0
var foundMatch: [String]

foundMatch = contents.captureGroups(withRegex: "<tr>(.*)</tr>", withStartPos: &myStartPos)
if foundMatch.isEmpty{
    print("Didnt find any rows ???")
}

myStartPos = 0
foundMatch = "<tr><td><strong>Total</strong></td><td><strong>1.2 mi</strong></td><td><strong>22:12</strong></td><td><strong>22:12</strong></td><td><strong>1:08/100m</strong></td><td><strong>1</strong></td><td><strong>2</strong></td><td><strong>4</strong></td></tr>".captureGroups(withRegex: "<tr>(.*)</tr>", withStartPos: &myStartPos)

以下是我正在使用的文件的内容:

        <tr>
            <td><strong>Total</strong></td>
            <td><strong>1.2 mi</strong></td>
          <td><strong>22:12</strong></td>
            <td><strong>22:12</strong></td>
            <td><strong>1:08/100m</strong></td>
          <td><strong>1</strong></td>
          <td><strong>2</strong></td>
          <td><strong>4</strong></td>
        </tr>

1 个答案:

答案 0 :(得分:1)

尝试使用

<tr>((.|\n)*)</tr>

<tr>((.|\n|\r)*)</tr>

''仅匹配单行字符