Qt Regexp从Html字符串中提取<p>标签

时间:2016-08-02 23:47:38

标签: html regex qt extract qregexp

我有一个RichText,我将QTextEdit中的Html源存储在一个字符串中。 我想要做的是逐个提取所有行(我有4-6行)。 字符串如下所示:

//html opening stuff
<p style = attributes...><span style = attributes...>My Text</span></p>
//more lines like this
//html closing stuff

所以我需要从开头的p标签到结束的p标签(包括p标签)的全线。 我检查并尝试了我在这里和其他网站上找到的所有内容,但仍然没有结果。

这是我的代码(&#34; htmlStyle&#34;是输入字符串):

QStringList list;
QRegExp rx("(<p[^>]*>.*?</p>)");
int pos = 0;

while ((pos = rx.indexIn(htmlStyle, pos)) != -1) {
    list << rx.cap(1);
    pos += rx.matchedLength();
}

或者没有正则表达式可以做任何其他方法吗?

3 个答案:

答案 0 :(得分:2)

HTML / XML不是常规语法。你不能用正则表达式解析它。参见例如this question。解析HTML并非易事。

您可以使用QTextDocumentQTextBlockQTextCursor等来迭代富文本文档中的段落。所有HTML解析都会为您完成。这正是QTextEdit支持的HTML子集:它使用QTextDocument作为内部表示。您可以使用QTextEdit::document()直接从窗口小部件中获取它。 E.g:

void iterate(QTextEdit * edit) {
   auto const & doc = *edit->document();
   for (auto block = doc.begin(); block != doc.end(); block.next()) {
      // do something with text block e.g. iterate its fragments
      for (auto fragment = block.begin(); fragment != block.end(); fragment++) {
         // do something with text fragment
      }
   }
}

您不应手动错误地解析HTML,而应探索QTextDocument的结构并根据需要使用它。

答案 1 :(得分:1)

下面是纯java方式,希望这会有所帮助:

# Your matrix
mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), 
                   .Dimnames = list(c("chr1:1163804", "chr1:1888193"), 
                    c("00860.GT", "00861.GT")))

# Using a data table approach
library(data.table)

# Casting to data table - row.names will be converted to a column called 'rn'.
mymat = as.data.table(mymat, keep.rownames = T)

# Find "GT" columns
GTcols = grep("GT", colnames(mymat))

# Get number before ".GT"
selectedCols = gsub(".GT", "", colnames(mymat)[GTcols])

selectedCols
[1] "00860" "00861"

# Create ".DP" columns
mymat[, paste0(selectedCols, ".DP") := 50, with = F]

mymat
             rn 00860.GT 00861.GT 00860.DP 00861.DP
1: chr1:1163804      0/1      0/0       50       50
2: chr1:1888193      1/1      0/0       50       50

# Create "GT" to "AD" mapping
GTToADMapping = c("50,0", "25/25", "0/50")
names(GTToADMapping) = c("0/0", "0/1", "1/1")

GTToADMapping
0/0     0/1     1/1 
"50,0" "25/25"  "0/50" 

# This function will return the "AD" mapping given the values of "GT"
mapGTToAD <- function(x){
  return (GTToADMapping[x])
}

# Here, we create the AD columns using the GT mapping
mymat[, (paste0(selectedCols, ".AD")) := lapply(.SD, mapGTToAD), with = F,
        .SDcols = colnames(mymat)[GTcols]]

             rn 00860.GT 00861.GT 00860.DP 00861.DP 00860.AD 00861.AD
1: chr1:1163804      0/1      0/0       50       50    25/25     50,0
2: chr1:1888193      1/1      0/0       50       50     0/50     50,0

# We can sort the data now as you have it
colOrder = as.vector(rbind(paste0(selectedCols, ".GT"), 
                     paste0(selectedCols, ".AD"), 
                     paste0(selectedCols, ".DP")))
mymat = mymat[, c("rn", colOrder), with = F]

mymat
             rn 00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
1: chr1:1163804      0/1    25/25       50      0/0     50,0       50
2: chr1:1888193      1/1     0/50       50      0/0     50,0       50

# Put it back in the format you had
mymat2 = as.matrix(mymat[,-1, with = F])
rownames(mymat2) = mymat$rn

mymat2
             00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
chr1:1163804 "0/1"    "25/25"  "50"     "0/0"    "50,0"   "50"    
chr1:1888193 "1/1"    "0/50"   "50"     "0/0"    "50,0"   "50"    

答案 2 :(得分:0)

对于那些需要完整Qt解决方案的人,我根据@Aditya Poorna的回答找出了答案。谢谢你的提示!

以下是代码:

int startIndex = htmlStyle.indexOf("<p");
int endIndex = htmlStyle.indexOf("</p>");

while (startIndex >= 0) {
    endIndex = endIndex + 4;
    QStringRef subString(&htmlStyle, startIndex, endIndex-startIndex);
    qDebug() << subString;
    startIndex = htmlStyle.indexOf("<p", startIndex + 1);
    endIndex = htmlStyle.indexOf("</p>", endIndex + 1);
}

“QStringRef subString”从“startIndex”进入“htmlStyle”,直到“endIndex-startIndex”的长度!