Question

我有一个这样的清单：

['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

由此我想制作如下的子列表：

id = ["32a45", "32a47", "32a48"]
date=["2017-01-01", "2017-01-05", "2017-01-07"]

我该怎么做？

感谢。

修改：这是original question 这是一个破碎的xml文件，标签搞砸了，因此我不能使用xmltree。所以我正在尝试别的东西。

Answer 1

使用re.search()函数的简单解决方案：

import re

l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

ids, dates = [], []
for i in l:
    ids.append(re.search(r'id="([^"]+)"', i).group(1))
    dates.append(re.search(r'date="([^"]+)"', i).group(1))

print(ids)    # ['32a45', '32a47', '32a48']
print(dates)  # ['2017-01-01', '2017-01-05', '2017-01-07']

Answer 2

用ET解析：

import xml.etree.ElementTree as ET
strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

id_ = []
date = []
for string in strings:
    tree = ET.fromstring(string+"</text>") #corrects wrong format
    id_.append(tree.get("id"))
    date.append(tree.get("date"))

print(id_) #  ['32a45', '32a47', '32a48']
print(date) # ['2017-01-01', '2017-01-05', '2017-01-07']

更新，完整的紧凑示例：根据您在此处描述的原始问题：How can I build an sqlite table from this xml/txt file using python?

import xml.etree.ElementTree as ET
import pandas as pd

strings = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

cols = ["id","language","date","time","timezone"]
data = [[ET.fromstring(string+"</text>").get(col) for col in cols] for string in strings]    
df = pd.DataFrame(data,columns=cols)

    id  language    date    time    timezone
0   32a45   ENG     2017-01-01  11:00   Eastern
1   32a47   ENG     2017-01-05  1:00    Central
2   32a48   ENG     2017-01-07  3:00    Pacific

现在您可以使用： df.to_sql（）

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Answer 3

id = [i.split(' ')[1].split('=')[1].strip('"') for i in list]
date = [i.split(' ')[3].split('=')[1].strip('"') for i in list]

但是文件看起来很奇怪，如果原始文件是html或xml，有更好的方法来获取数据。

Answer 4

由于您提供的数据似乎已损坏/部分xml片段，我个人会尝试修复xml并使用xml.etree模块提取数据。但是，如果你有正确的xml，那么你可以更容易地使用xml.etree模块。

使用xml.etree的示例解决方案：

from xml.etree import ElementTree as ET

data = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">',
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
'<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

ids = []
dates = []
for element in data:
    #This wraps the element in a root tag and gives it a closing tag to
    #  repair the xml to a valid format.
    root = ET.fromstring('{}</text>'.format(element))

    #As we have formatted the xml ourselves we can guarantee that it's first
    #  child will always be the desired element.
    ids.append(root.attrib['id'])
    dates.append(root.attrib['date'])

print(ids)    # ['32a45', '32a47', '32a48']
print(dates)  # ['2017-01-01', '2017-01-05', '2017-01-07']

Answer 5

与其他更好的答案一起，您可以手动解析数据（更简单）：

for line in lines:
    id = line[line.index('"')+1:]
    line = id
    line = id[line.index('"')+1:]
    id = id[:id.index('"')]
    print('id: ' + id)

然后您只需将其推入新列表，对下面的其他值重复相同的过程，只需更改变量名称即可。

Answer 6

不像使用re的@RomanPerekhrest解决方案一样优雅，但在这里：

def extract(lst, kwd):
   out = []
   for t in lst:
       index1 = t.index(kwd) + len(kwd) + 1
       index2 = index1 + t[index1:].index('"') + 1
       index3 = index2 + t[index2:].index('"')
       out.append(t[index2:index3])
   return out

然后

>>> extract(lst, kwd='id')
['32a45', '32a47', '32a48']

Answer 7

使用re模块更容易理解：这是代码：

l = ['<text id="32a45" language="ENG" date="2017-01-01" time="11:00" timezone="Eastern">', 
'<text id="32a47" language="ENG" date="2017-01-05" time="1:00" timezone="Central">',
 '<text id="32a48" language="ENG" date="2017-01-07" time="3:00" timezone="Pacific">']

import re
id =[]
dates= []
for i in l:
    id.append(re.search(r'id="(.+?)"',i, re.M|re.I).group(1))
    dates.append(re.search(r'date="(.+?)"',item, re.M|re.I).group(1))

输出：

print id     #id= ['32a45', '32a47', '32a48']
print dates  #dates= ['2017-01-07', '2017-01-07', '2017-01-07']

如何根据python中的字符串从列表中创建子列表？

7 个答案: