我正在研究代表美国地区并在其中包含州的df。各州旁边有[edit]。 2个州之间的所有区域都属于上面的州。我认为这应该可行,但是由于某种原因它并没有改变df的值...您知道这里发生了什么吗?你会怎么做?
这是df
0 Alabama[edit]
1 Auburn
2 Florence
3 Jacksonville
4 Livingston
5 Montevallo
6 Troy
7 Tuscaloosa
8 Tuskegee
9 Alaska[edit]
10 Fairbanks
11 Arizona[edit]
12 Flagstaff
13 Tempe
14 Tucson
15 Arkansas[edit]
16 Arkadelphia
17 Conway
18 Fayetteville
19 Jonesboro
20 Magnolia
21 Monticello
22 Russellville
23 Searcy
24 California[edit]
25 Angwin
26 Arcata
27 Berkeley
28 Chico
29 Claremont
这是我的不改变df的解决方案:
df['state'] = 'replace this'
edit = '\[edit\]'
for index, row in df.iterrows():
if edit in row['RegionName']:
st = df.loc[index, ['RegionName']]
df.loc[index, ['RegionName']] = None
df.iloc[index:, 1] = st
答案 0 :(得分:1)
假设您的列名是2018-08-19 15:28:00.987654
,则可以使用regions
:
str.extract
如果您想保留“地区”列中的状态,只需删除df.assign(
state=df.region.str.extract(r'(.*?)\[edit\]').ffill()
).mask(df.region.str.endswith('[edit]')).dropna()
region state
1 Auburn Alabama
2 Florence Alabama
3 Jacksonville Alabama
4 Livingston Alabama
5 Montevallo Alabama
6 Troy Alabama
7 Tuscaloosa Alabama
8 Tuskegee Alabama
10 Fairbanks Alaska
12 Flagstaff Arizona
13 Tempe Arizona
14 Tucson Arizona
16 Arkadelphia Arkansas
17 Conway Arkansas
18 Fayetteville Arkansas
19 Jonesboro Arkansas
20 Magnolia Arkansas
21 Monticello Arkansas
22 Russellville Arkansas
23 Searcy Arkansas
25 Angwin California
26 Arcata California
27 Berkeley California
28 Chico California
29 Claremont California
:
mask
答案 1 :(得分:0)
如果我对您的理解正确,那么这是一种避免显式循环的解决方案。
# Create a new column of state names with NaN in any
# row that did not contain a state name flagged with "edit"
df['state'] = df[df['RegionName'].str.contains('edit')]['RegionName']
# Forward-fill the NaNs in the state column
df = df.ffill()
# Delete rows where RegionName == state and
# reset index to default integers
df = df[df.iloc[:, 0] != df.iloc[:, 1]].reset_index(drop=True)
# Delete "[edit]" flag from strings
df['state'] = df['state'].str.replace('\[edit\]', '')
# Result:
df
RegionName state
0 Auburn Alabama
1 Florence Alabama
2 Jacksonville Alabama
3 Livingston Alabama
4 Montevallo Alabama
5 Troy Alabama
6 Tuscaloosa Alabama
7 Tuskegee Alabama
8 Fairbanks Alaska
9 Flagstaff Arizona
10 Tempe Arizona
11 Tucson Arizona
12 Arkadelphia Arkansas
13 Conway Arkansas
14 Fayetteville Arkansas
15 Jonesboro Arkansas
16 Magnolia Arkansas
17 Monticello Arkansas
18 Russellville Arkansas
19 Searcy Arkansas
20 Angwin California
21 Arcata California
22 Berkeley California
23 Chico California
24 Claremont California
答案 2 :(得分:0)
尝试以下代码
import pandas as pd
import numpy as np
df['State']=df['RegionName']
df.loc[~df['RegionName'].str.contains('[edit]'),'State']=np.nan
df['State']=df['State'].str.replace('[edit]','').fillna(method='ffill')
print(df)
答案 3 :(得分:0)
创建一个标识状态的掩码。使用它为状态创建一个新列,向前填充,并仅选择掩码排除的行。
mask = df.region.str.endswith('[edit]')
df.loc[mask, 'state'] = df.region[mask].str.replace('\[edit\]', '')
df.state = df.state.ffill()
df[~mask]
# outputs:
region state
1 Auburn Alabama
2 Florence Alabama
3 Jacksonville Alabama
4 Livingston Alabama
5 Montevallo Alabama
6 Troy Alabama
7 Tuscaloosa Alabama
8 Tuskegee Alabama
10 Fairbanks Alaska
12 Flagstaff Arizona
13 Tempe Arizona
14 Tucson Arizona
16 Arkadelphia Arkansas
17 Conway Arkansas
18 Fayetteville Arkansas
19 Jonesboro Arkansas
20 Magnolia Arkansas
21 Monticello Arkansas
22 Russellville Arkansas
23 Searcy Arkansas
25 Angwin California
26 Arcata California
27 Berkeley California
28 Chico California
29 Claremont California