Question

我正在尝试使用python 3中的HTMLParser库解析html页面。函数HTMLParser。getpos（）返回行号和解析的最后一个标记的偏移量。

例如，我知道我想要的“字符串”从行号10偏移5开始，到行号30偏移10偏移10如何从第10行偏移5到第30行偏移10得到子字符串？

感谢。

html = 'this holds the entire html code'
MyParser.feed(html) #now the parser works its magic
start = (10,5) #this is returned from HTMLParser.getpos(), 10 is the line number and 5 is the offset of that line
end = (30,10) #same here
#I want to do something like this (I know this is invalid python code)
substring = html.substring(start,end) #return the html code as a string from line 10 offset 5 to line 30 offset 10

更好的解释：

我正在尝试从字符串中获取子字符串。

我在python 3中理解它叫做slice：string [a：b] 所以如果我想要子串'jonny'形成字符串'Hello jonny smith' 我会这样做：substring = 'Hello jonny smith'[6:11] 问题是HTMLParser.getpos()返回一个元组（行号，该行的偏移量）所以我不能这样做：substring = multy_line_string[line number:offset]

Answer 1

假设您对HTML解析感兴趣，请尝试lxml - ＆gt; http://docs.python-guide.org/en/latest/scenarios/scrape/

给出行号和偏移量的python 3中的子字符串

1 个答案: