我找到了parse
html
与pandas
的绝佳方式。我的数据是一种奇怪的格式(附在下面)。我想将这些数据拆分为2个单独的dataframes
。
注意每个cell
如何用,
分隔...... 是否有任何真正有效的方法来拆分所有这些单元格并创建2个数据框架,一个用于标签,一个用于标签括号中的( value )
NumPy
包含所有ufuncs
,我可以在string
dtypes
上使用它们,因为它们可以np.array
转换为DF.as_matrix()
1}?我试图避开for loops
,我可以遍历所有索引并填充一个空数组,但这非常野蛮。
我正在使用Beaker Notebook
btw,这真的很酷(强烈推荐)
#Set URL Destination
url = "http://www.reef.org/print/db/stats"
#Process raw table
DF_raw = pd.pandas.read_html(url)[0]
#Get start/end indices of table
start_label = "10 Most Frequent Species"; start_idx = (DF_raw.iloc[:,0] == start_label).argmax()
end_label = "Top 10 Sites for Species Richness"; end_idx = (DF_raw.iloc[:,0] == end_label).argmax()
#Process table
DF_freqSpecies = pd.DataFrame(
DF_raw.as_matrix()[(start_idx + 1):end_idx,:],
columns = DF_raw.iloc[0,:]
)
DF_freqSpecies
#Split these into 2 separate DataFrames
这是我天真的做法:
import re
DF_species = pd.DataFrame(np.zeros_like(DF_freqSpecies),columns=DF_freqSpecies.columns)
DF_freq = pd.DataFrame(np.zeros_like(DF_freqSpecies).astype(str),columns=DF_freqSpecies.columns)
dims = DF_freqSpecies.shape
for i in range(dims[0]):
for j in range(dims[1]):
#Parse current dataframe
species, freq = re.split("\s\(\d",DF_freqSpecies.iloc[i,j])
freq = float(freq[:-1])
#Populate split DataFrames
DF_species.iloc[i,j] = species
DF_freq.iloc[i,j] = freq
我想要这两个数据帧作为我的输出:
答案 0 :(得分:2)
你可以这样做:
DF1:
In [182]: df1 = DF_freqSpecies.replace(r'\s*\(\d+\.*\d*\)', '', regex=True)
In [183]: df1.head()
Out[183]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska \
0 Bluehead Copper Rockfish
1 Blue Tang Lingcod
2 Stoplight Parrotfish Painted Greenling
3 Bicolor Damselfish Sunflower Star
4 French Grunt Plumose Anemone
0 Hawaii Tropical Eastern Pacific \
0 Saddle Wrasse King Angelfish
1 Hawaiian Whitespotted Toby Mexican Hogfish
2 Raccoon Butterflyfish Barberfish
3 Manybar Goatfish Flag Cabrilla
4 Moorish Idol Panamic Sergeant Major
0 South Pacific Northeast US and Eastern Canada \
0 Regal Angelfish Cunner
1 Bluestreak Cleaner Wrasse Winter Flounder
2 Manybar Goatfish Rock Gunnel
3 Brushtail Tang Pollock
4 Two-spined Angelfish Grubby Sculpin
0 South Atlantic States Central Indo-Pacific
0 Slippery Dick Moorish Idol
1 Belted Sandfish Three-spot Dascyllus
2 Black Sea Bass Bluestreak Cleaner Wrasse
3 Tomtate Blacklip Butterflyfish
4 Cubbyu Clark's Anemonefish
和DF2
In [193]: df2 = DF_freqSpecies.replace(r'.*\((\d+\.*\d*)\).*', r'\1', regex=True)
In [194]: df2.head()
Out[194]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska Hawaii \
0 85 54.6 92
1 84.8 53.2 85.8
2 81 50.8 85.7
3 79.9 50.2 85.7
4 74.8 49.7 82.9
0 Tropical Eastern Pacific South Pacific Northeast US and Eastern Canada \
0 85.7 79 67.4
1 82.5 77.3 46.6
2 75.2 73.9 26.2
3 68.9 73.3 25.2
4 67.9 72.8 23.7
0 South Atlantic States Central Indo-Pacific
0 79.7 80.1
1 78.5 75.6
2 78.5 73.5
3 72.7 71.4
4 65.7 70.2
RegEx debugging and explanation:
我们基本上想删除所有内容,但括号中的数字除外:
(\d+\.*\d*)
- group(1) - 这是我们的号码
\((\d+\.*\d*)\)
- 括号中的数字
.*\((\d+\.*\d*)\).*
- 整个事情 - 在'(','(',我们的数字,')'之前的任何事情,直到细胞结束的任何事情
它将被替换为组(1) - 我们的号码