Python - 从多个文件中的多个字符串中提取文本

时间:2017-06-17 14:28:39

标签: python python-2.7

Python大师,我需要从List中提取所有文本直到URL,下面是模式的示例。我还希望脚本能够循环文件夹中的所有文件。

 .....
 .....
 <List>Product Line</List>
 <URL>http://teamspace.abb.com/sites/Product</URL>
 ...
 ...
 <List>Contact Number</List>
 <URL>https://teamspace.abb.com/sites/Contact</URL>
 ....
 ....

预期输出

<List>Product Line</List>
<URL>http://teamspace.abb.com/sites/Product</URL>
<List>Contact Number</List>
<URL>https://teamspace.abb.com/sites/Contact</URL>

我开发了一个脚本,它能够循环文件夹中的所有文件,然后提取从List开始的所有关键字,但我无法包含URL。非常感谢您的帮助。

# defining location of parent folder
  BASE_DIRECTORY = 'C:\D_Drive\Projects\Test'
  output_file = open('C:\D_Drive\Projects\\Test\Output.txt', 'w')
  output = {}
  file_list = []

# scanning through sub folders
for (dirpath, dirnames, filenames) in os.walk(BASE_DIRECTORY):
for f in filenames:
    if 'xml' in str(f):
        e = os.path.join(str(dirpath), str(f))
        file_list.append(e)

for f in file_list:
print f
txtfile = open(f, 'r')
output[f] = []
for line in txtfile:
    if '<List>' in line:
        output[f].append(line)
tabs = []
for tab in output:
tabs.append(tab)

tabs.sort()
for tab in tabs:
output_file.write(tab + '\n')
output_file.write('\n')
for row in output[tab]:
    output_file.write(row + '')
output_file.write('\n')
output_file.write('----------------------------------------------------------\n')

raw_input()

Sample file

3 个答案:

答案 0 :(得分:2)

尝试xml.etree.ElementTree

import xml.etree.ElementTree as ET
tree = ET.parse('Product_Workflow.xml')
root = tree.getroot()
with open('Output.txt','w') as opfile:
    for l,u in zip(root.iter('List'),root.iter('URL')):
        opfile.write(ET.tostring(l).strip())
        opfile.write('\n')
        opfile.write(ET.tostring(u).strip())
        opfile.write('\n')

Output.txt会给你:

<List>Emove</List>
<URL>http://teamspace.abb.com/sites/Product</URL>
<List>Asset_KWT</List>
<URL>https://teamspace.slb.com/sites/Contact</URL>

答案 1 :(得分:1)

你的答案大多是正确的,只需要为文件创建迭代器就可以了。你可以使用元素树或漂亮的汤,但是当它是非xml或html文件时,理解这样的迭代也会起作用。

txtfile = iter(open(f, 'r'))  # change here
output[f] = []
for line in txtfile:
    if '<List>' in line:
        output[f].append(line)
        output[f].append(next(txtfile))  # and here

答案 2 :(得分:1)

您可以使用filter或类似的列表理解:

tgt=('URL', 'List')
with open('file') as f:  
    print filter(lambda line: any(e in line for e in tgt), (line for line in f))  

或者:

with open('/tmp/file') as f:  
    print [line for line in f if any(e in line for e in tgt)]

打印:

[' <List>Product Line</List>\n', ' <URL>http://teamspace.abb.com/sites/Product</URL>\n', ' <List>Contact Number</List>\n', ' <URL>https://teamspace.abb.com/sites/Contact</URL>\n']