以下是输入文件的示例:
<html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body>
HERE IS A LOT OF TEXT, THAT IS NOT INTERESTING
<br>
<div id="text"><div id="text-interesting1">11/222-AA</div>
<h2>This is the title</h2>
<P>Here is some multiline desc-<br>
cription about what is <br><br>
going on here
</div>
<div id="text2"><div id="text-interesting2">IV-VI</div>
<br>
<h1> Some really interesting text</h1>
</body>
</html>
现在我要grep这个文件的多个块,比如<div id="text-interesting1">
和</div>
之间,然后是<P>
和</div>
之间,然后是<div id="text-interesting2">
和{ {1}}还有更多。关键是,我想要检索多个值。
我想将这些值写入文件,例如逗号分开。怎么办?
从卢克提供的例子中我做了以下事情:
</div>
输出是:
import os, re
path = 'C:/Temp/Folder1/allTexts'
listing = os.listdir(path)
for infile in listing:
text = open(path + '/' + infile).read()
match = re.search('<div id="text-interesting1">', text)
if match is None:
continue
start = match.end()
end = re.search('</div>', text).start()
print (text[start:end])
match = re.search('<h2>', text)
if match is None:
continue
start = match.end()
end = re.search('</h2>', text).start()
print (text[start:end])
match = re.search('<P>', text)
if match is None:
continue
start = match.end()
end = re.search('</div>', text).start()
print (text[start:end])
match = re.search('<div id="text-interesting2">', text)
if match is None:
continue
start = match.end()
end = re.search('</div>', text).start()
print (text[start:end])
match = re.search('<h1>', text)
if match is None:
continue
start = match.end()
end = re.search('</h1>', text).start()
print (text[start:end])
print ('--------------------------------------')
为什么
部分不起作用?
答案 0 :(得分:1)
这是一个开始:
import os, re
path = 'C:/Temp/Folder1/allTexts'
listing = os.listdir(path)
for infile in listing:
text = open(path + '/' + infile).read()
match = re.search('<div id="text-interesting1">', text)
if match is None:
continue
start = match.start()
end = re.search('<div id="text-interesting2">', text).start()
print text[start:end]
答案 1 :(得分:0)
另一种策略是解析XML。您需要整理文件,因为严格的XML需要匹配标签,案例一致性等。以下是一个示例:
from xml.etree import ElementTree
from cStringIO import StringIO
import sys
tree = ElementTree.ElementTree()
tree.parse(StringIO(sys.stdin.read()))
print "All tags:"
for e in tree.getiterator():
print e.tag
print e.text
print "Only div:"
for i in tree.find("{http://www.w3.org/1999/xhtml}body").findall("{http://www.w3.org/1999/xhtml}div"):
print i.text
稍微修改一下文件:
<html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body>
HERE IS A LOT OF TEXT, THAT IS NOT INTERESTING
<br></br>
<div id="text"><div id="text-interesting1">11/222-AA</div>
<h2>This is the title</h2>
<p>Here is some multiline desc-<br></br>
cription about what is <br></br><br></br>
going on here</p>
</div>
<div id="text-interesting2">IV-VI</div>
<br></br>
<h1> Some really interesting text</h1>
</body>
</html>
示例输出,
> cat file.xml | ./tb.py
All tags:
{http://www.w3.org/1999/xhtml}html
{http://www.w3.org/1999/xhtml}head
{http://www.w3.org/1999/xhtml}body
HERE IS A LOT OF TEXT, THAT IS NOT INTERESTING
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}div
None
{http://www.w3.org/1999/xhtml}div
11/222-AA
{http://www.w3.org/1999/xhtml}h2
This is the title
{http://www.w3.org/1999/xhtml}p
Here is some multiline desc-
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}div
IV-VI
{http://www.w3.org/1999/xhtml}br
None
{http://www.w3.org/1999/xhtml}h1
Some really interesting text
Only div:
None
IV-VI
但是很多HTML很难解析为严格的XML,所以这可能很难为你的案例实现。