我有一个这样的清单:
['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
由此我想制作如下的子列表:
id = ["32a45", "32a47", "32a48"]
date=["2017-01-01", "2017-01-05", "2017-01-07"]
我该怎么做?
感谢。
修改:这是original question 这是一个破碎的xml文件,标签搞砸了,因此我不能使用xmltree。所以我正在尝试别的东西。
答案 0 :(得分:5)
使用re.search()
函数的简单解决方案:
import re
l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
ids, dates = [], []
for i in l:
ids.append(re.search(r'id="([^"]+)"', i).group(1))
dates.append(re.search(r'date="([^"]+)"', i).group(1))
print(ids) # ['32a45', '32a47', '32a48']
print(dates) # ['2017-01-01', '2017-01-05', '2017-01-07']
答案 1 :(得分:1)
用ET解析:
import xml.etree.ElementTree as ET
strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
id_ = []
date = []
for string in strings:
tree = ET.fromstring(string+"</text>") #corrects wrong format
id_.append(tree.get("id"))
date.append(tree.get("date"))
print(id_) # ['32a45', '32a47', '32a48']
print(date) # ['2017-01-01', '2017-01-05', '2017-01-07']
更新,完整的紧凑示例: 根据您在此处描述的原始问题:How can I build an sqlite table from this xml/txt file using python?
import xml.etree.ElementTree as ET
import pandas as pd
strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
cols = ["id","language","date","time","timezone"]
data = [[ET.fromstring(string+"</text>").get(col) for col in cols] for string in strings]
df = pd.DataFrame(data,columns=cols)
id language date time timezone
0 32a45 ENG 2017-01-01 11:00 Eastern
1 32a47 ENG 2017-01-05 1:00 Central
2 32a48 ENG 2017-01-07 3:00 Pacific
现在您可以使用: df.to_sql()
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
答案 2 :(得分:0)
id = [i.split(' ')[1].split('=')[1].strip('"') for i in list]
date = [i.split(' ')[3].split('=')[1].strip('"') for i in list]
但是文件看起来很奇怪,如果原始文件是html或xml,有更好的方法来获取数据。
答案 3 :(得分:0)
由于您提供的数据似乎已损坏/部分xml片段,我个人会尝试修复xml并使用xml.etree
模块提取数据。但是,如果你有正确的xml,那么你可以更容易地使用xml.etree
模块。
使用xml.etree
的示例解决方案:
from xml.etree import ElementTree as ET
data = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
ids = []
dates = []
for element in data:
#This wraps the element in a root tag and gives it a closing tag to
# repair the xml to a valid format.
root = ET.fromstring('{}</text>'.format(element))
#As we have formatted the xml ourselves we can guarantee that it's first
# child will always be the desired element.
ids.append(root.attrib['id'])
dates.append(root.attrib['date'])
print(ids) # ['32a45', '32a47', '32a48']
print(dates) # ['2017-01-01', '2017-01-05', '2017-01-07']
答案 4 :(得分:0)
与其他更好的答案一起,您可以手动解析数据(更简单):
for line in lines:
id = line[line.index('"')+1:]
line = id
line = id[line.index('"')+1:]
id = id[:id.index('"')]
print('id: ' + id)
然后您只需将其推入新列表,对下面的其他值重复相同的过程,只需更改变量名称即可。
答案 5 :(得分:0)
不像使用re
的@RomanPerekhrest解决方案一样优雅,但在这里:
def extract(lst, kwd):
out = []
for t in lst:
index1 = t.index(kwd) + len(kwd) + 1
index2 = index1 + t[index1:].index('"') + 1
index3 = index2 + t[index2:].index('"')
out.append(t[index2:index3])
return out
然后
>>> extract(lst, kwd='id')
['32a45', '32a47', '32a48']
答案 6 :(得分:0)
使用re
模块更容易理解:
这是代码:
l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']
import re
id =[]
dates= []
for i in l:
id.append(re.search(r'id="(.+?)"',i, re.M|re.I).group(1))
dates.append(re.search(r'date="(.+?)"',item, re.M|re.I).group(1))
输出:
print id #id= ['32a45', '32a47', '32a48']
print dates #dates= ['2017-01-07', '2017-01-07', '2017-01-07']