Question

我有大量的XML文件〜3000。每个xml文件都包含一个用户推文。文件名是用户标识。我想创建一个包含3000行和两列的pandas数据框。一列是user_id，另一列是user tweets。

我能够提取一个示例XML文件的内容并将其保存在列表中。

#parse the xml file 
mytree=ET.parse('1a6446f74e20c558a2cef325394499.xml')
myroot=mytree.getroot()

tweet_list=[element.text for element in myroot.iter('document')]

我的xml示例

<author lang="en">
    <documents>
        <document><![CDATA[I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... ]]></document>
        <document><![CDATA[Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.]]></document>
        <document><![CDATA[happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]]></document>
    </documents>
</author>

我想将此代码应用于位于一个目录中的所有xml文件。然后将列表转换为数据框中的行。

我尝试使用此代码来获取文件的内容，但无法通过myroot

import os
path = './data'

    for filename in os.listdir(path):
        if not filename.endswith('.xml'): 
            continue
        fullname = os.path.join(path, filename)
        #print(fullname)
        mytree = ET.parse(fullname)
        myroot=mytree.getroot()

任何提示都会有所帮助。

Answer 1

以下代码将使用Path.rglob模块中的pathlib查找所有文件
这将创建一个包含来自所有user_id文件的推文的单个数据框
作为示例，您的示例数据位于名为test_00.xml的目录中的三个分别名为test_01.xml，test_02.xml和xml的文件中
2.94 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)用于4640个文件，每个文件包含3条推文。

选项1：每行输出1条推文

from pathlib import Path
import xml.etree.ElementTree as ET
import pandas as pd

# path to top directory
p = Path('xml')

# find all files
files = p.rglob('*.xml')

# create dataframe
df_list = list()
for file in files:
    mytree=ET.parse(file)
    myroot=mytree.getroot()
    tweet_list=[element.text for element in myroot.iter('document')]
    df_list.append(pd.DataFrame({'user_id': file.stem, 'tweets': tweet_list}))

df = pd.concat(df_list).reset_index(drop=True)

输出1

 user_id                                                                                                                 tweets
 test_00   I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... 
 test_00  Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
 test_00            happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M 
 test_01   I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... 
 test_01  Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
 test_01            happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M 
 test_02   I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... 
 test_02  Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
 test_02            happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M

选项2：每`user_id`输出1行

p = Path('xml')
files = p.rglob('*.xml')

df_list = list()
for file in files:
    mytree=ET.parse(file)
    myroot=mytree.getroot()
    tweet_list = [[element.text for element in myroot.iter('document')]]
    df_list.append(pd.DataFrame({'user_id': file.stem, 'tweets': tweet_list }))

df = pd.concat(df_list).reset_index(drop=True)

输出2

 user_id                                                                                                                                                                                                                                                                                                                                                      tweets
 test_00  [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]
 test_01  [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]
 test_02  [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]

选项3：使用`collections.defaultdict`-每条推文1行

sammywemmy
输出df与Output 1
806 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)用于4640个文件，每个文件包含3条推文。

from collections import defaultdict
from pathlib import Path
import xml.etree.ElementTree as ET
import pandas as pd

# path to top directory
p = Path('xml')

# find all files
files = p.rglob('*.xml')

box = defaultdict(list)
for file in files:
    root = ET.parse(file).getroot()
    for element in root.iter("document"):
        box[file.stem].append(element.text)

# get the final data into a dataframe
# use T (transpose) and stack
df = pd.DataFrame(pd.DataFrame(box).T.stack()).reset_index(level=0).reset_index(drop=True).rename(columns={'level_0': 'user_id', 0: 'tweets'})

选项4：使用`collections.defaultdict`-每`user_id` 1行

输出df与Output 2

p = Path('xml')
files = p.rglob('*.xml')

box = defaultdict(list)
for file in files:
    root = ET.parse(file).getroot()
    box[file.stem].append([element.text for element in myroot.iter('document')])

df = pd.DataFrame(pd.DataFrame(box).T.stack()).reset_index(level=0).reset_index(drop=True).rename(columns={'level_0': 'user_id', 0: 'tweets'})

Answer 2

我创建了一个程序包是因为我有一个类似的用例。

pip install pandas_read_xml

这是您可能会使用它的方式。说所有的xml文件都在authors.zip文件中。

import pandas_read_xml as pdx

df = pdx.read_xml('authors.zip')

尽管这种xml格式不是我所期望的，所以可能需要检查一下它的作用。

将xml转换为Pandas数据框

2 个答案:

选项1：每行输出1条推文

输出1

选项2：每`user_id`输出1行

输出2

选项3：使用`collections.defaultdict`-每条推文1行

选项4：使用`collections.defaultdict`-每`user_id` 1行

将xml转换为Pandas数据框

2 个答案:

选项1：每行输出1条推文

输出1

选项2：每user_id输出1行

输出2

选项3：使用collections.defaultdict-每条推文1行

选项4：使用collections.defaultdict-每user_id 1行

选项2：每`user_id`输出1行

选项3：使用`collections.defaultdict`-每条推文1行

选项4：使用`collections.defaultdict`-每`user_id` 1行