我有大量的XML文件〜3000。每个xml文件都包含一个用户推文。文件名是用户标识。我想创建一个包含3000行和两列的pandas数据框。一列是user_id
,另一列是user tweets
。
我能够提取一个示例XML文件的内容并将其保存在列表中。
#parse the xml file
mytree=ET.parse('1a6446f74e20c558a2cef325394499.xml')
myroot=mytree.getroot()
tweet_list=[element.text for element in myroot.iter('document')]
我的xml示例
<author lang="en">
<documents>
<document><![CDATA[I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... ]]></document>
<document><![CDATA[Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.]]></document>
<document><![CDATA[happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]]></document>
</documents>
</author>
我想将此代码应用于位于一个目录中的所有xml文件。然后将列表转换为数据框中的行。
我尝试使用此代码来获取文件的内容,但无法通过myroot
import os
path = './data'
for filename in os.listdir(path):
if not filename.endswith('.xml'):
continue
fullname = os.path.join(path, filename)
#print(fullname)
mytree = ET.parse(fullname)
myroot=mytree.getroot()
任何提示都会有所帮助。
答案 0 :(得分:1)
Path.rglob
模块中的pathlib查找所有文件user_id
文件的推文的单个数据框test_00.xml
的目录中的三个分别名为test_01.xml
,test_02.xml
和xml
的文件中2.94 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
用于4640个文件,每个文件包含3条推文。from pathlib import Path
import xml.etree.ElementTree as ET
import pandas as pd
# path to top directory
p = Path('xml')
# find all files
files = p.rglob('*.xml')
# create dataframe
df_list = list()
for file in files:
mytree=ET.parse(file)
myroot=mytree.getroot()
tweet_list=[element.text for element in myroot.iter('document')]
df_list.append(pd.DataFrame({'user_id': file.stem, 'tweets': tweet_list}))
df = pd.concat(df_list).reset_index(drop=True)
user_id tweets
test_00 I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive...
test_00 Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
test_00 happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M
test_01 I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive...
test_01 Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
test_01 happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M
test_02 I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive...
test_02 Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
test_02 happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M
user_id
输出1行p = Path('xml')
files = p.rglob('*.xml')
df_list = list()
for file in files:
mytree=ET.parse(file)
myroot=mytree.getroot()
tweet_list = [[element.text for element in myroot.iter('document')]]
df_list.append(pd.DataFrame({'user_id': file.stem, 'tweets': tweet_list }))
df = pd.concat(df_list).reset_index(drop=True)
user_id tweets
test_00 [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]
test_01 [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]
test_02 [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]
collections.defaultdict
-每条推文1行df
与Output 1
806 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
用于4640个文件,每个文件包含3条推文。from collections import defaultdict
from pathlib import Path
import xml.etree.ElementTree as ET
import pandas as pd
# path to top directory
p = Path('xml')
# find all files
files = p.rglob('*.xml')
box = defaultdict(list)
for file in files:
root = ET.parse(file).getroot()
for element in root.iter("document"):
box[file.stem].append(element.text)
# get the final data into a dataframe
# use T (transpose) and stack
df = pd.DataFrame(pd.DataFrame(box).T.stack()).reset_index(level=0).reset_index(drop=True).rename(columns={'level_0': 'user_id', 0: 'tweets'})
collections.defaultdict
-每user_id
1行df
与Output 2
p = Path('xml')
files = p.rglob('*.xml')
box = defaultdict(list)
for file in files:
root = ET.parse(file).getroot()
box[file.stem].append([element.text for element in myroot.iter('document')])
df = pd.DataFrame(pd.DataFrame(box).T.stack()).reset_index(level=0).reset_index(drop=True).rename(columns={'level_0': 'user_id', 0: 'tweets'})
答案 1 :(得分:0)
我创建了一个程序包是因为我有一个类似的用例。
pip install pandas_read_xml
这是您可能会使用它的方式。 说所有的xml文件都在authors.zip文件中。
import pandas_read_xml as pdx
df = pdx.read_xml('authors.zip')
尽管这种xml格式不是我所期望的,所以可能需要检查一下它的作用。