Question

获取数据

然后使用其中一些。很抱歉没有复制粘贴它，但它是一个很长的xml。到目前为止，我试图以这些方式获取这些数据：

from urllib.request import urlopen
url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"
s = urlopen(url)
content = s.read()

print(content)看起来不错，现在我想从中获取数据

<tabela_rozklad data-aktualizacji="1480583567">
<DZIEN>2</DZIEN>
<GODZ>3</GODZ>
<ILOSC>2</ILOSC>
<TYG>0</TYG>
<ID_NAUCZ>66</ID_NAUCZ>
<ID_SALA>79</ID_SALA>
<ID_PRZ>104</ID_PRZ>
<RODZ>W</RODZ>
<GRUPA>1</GRUPA>
<ID_ST>13</ID_ST>
<SEM>1</SEM>
<ID_SPEC>0</ID_SPEC>
</tabela_rozklad>

如何处理这些数据以方便使用？

Answer 1

您可以使用美丽的汤并捕捉您想要的标签。下面的代码可以帮助您入门！

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"

# secure url content
response = requests.get(url).content
soup = BeautifulSoup(response)

# find each tabela_rozklad
tables = soup.find_all('tabela_rozklad')

# for each tabela_rozklad looks like there is 12 nested corresponding   tags
tags = ['dzien', 'godz', 'ilosc', 'tyg', 'id_naucz', 'id_sala',
    'id_prz', 'rodz', 'grupa', 'id_st', 'sem', 'id_spec']

# initialize empty dataframe
df = pd.DataFrame()

# iterate over each tabela_rozklad and extract each tag and append to pandas dataframe
for table in tables:
    all = map(lambda x: table.find(x).text, tags)
    df = df.append([all])

# insert tags as columns
df.columns = tags

# display first 5 rows of table
df.head()

# and the shape of the data
df.shape # 665 rows, 12 columns

# and now you can get to the information using traditional pandas  functionality

# for instance, count observations by rodz
df.groupby('rodz').count()

# or subset only observations where rodz = J
J = df[df.rodz == 'J']

从webservice获取xml？

1 个答案: