嗨,我想遍历dataframe列的行以在单元格中找到特定的字符串,并在找到它们时对其进行标记。
数据框(还有数千行)
ID PRICE TEXT AFRICA ASIA EUROPE BRITAIN / IRELAND / SCOTLAND NOTHING FROM TO STAIRES_COUNT MAX_CLEARANCE_M 3 seater couch 4 seater+ couch < 60 inches TV 60 inches+ TV Washer - front loader Box / bag / misc
AAA 1000 travel to africa x 2\n\nAirport pick up included.
from A to B. there are 2 flights of staires.
Furnitures 3 seater couch x 1
4 seater+ couch x 1
< 60 inches TV x 1
60 inches+ TV x 1
Washer - front loader x 1
Box / bag / misc x 1
The maximum clearance is 1.5m.
BBB 1000 trip to asia x 2\n\nWorthwhile experience.
from C to D
The maximum clearance is 1.5m.
CCC 1000 holiday in europe x2\n\nLocal experience.
from C to D
DDD 1000 holiday in Fiji x2\n\nLocal experience.
from A to D. there are 2 flights of staires.
continents = ['africa', 'asia', 'europe', '3 seater couch','4 seater+ couch','<60 inches TV','60 inches+ TV','Washer - front loader','Box / bag / misc']
(列表很长,这是其中的一部分)。
我想从continents
列中找到TEXT
,并标记出特定大陆有多少人。
我要查找并标记的其他字符串是from '...' to '...'
,There are '...' flights of stairs.
,The maximum clearance is '...'m.
。
'...'
是我要查找的字符串的位置。
理想的输出
ID PRICE TEXT AFRICA ASIA EUROPE BRITAIN / IRELAND / SCOTLAND NOTHING FROM TO STAIRES_COUNT MAX_CLEARANCE_M 3 seater couch 4 seater+ couch < 60 inches TV 60 inches+ TV Washer - front loader Box / bag / misc
AAA 1000 travel to africa x 2\nasia x 2\neurope x 2\n\nAirport pick up included. 2 2 2 0 0 A B 2 1.5. 1 1 1 1 1 1
from A to B. there are 2 flights of staires. Furnitures 3 seater couch x 1
4 seater+ couch x 1
< 60 inches TV x 1
60 inches+ TV x 1
Washer - front loader x 1
Box / bag / misc x 1
The maximum clearance is 1.5m.
BBB 1000 trip to asia x 2\nBRITAIN / IRELAND / SCOTLAND x 2 \n\nWorthwhile experience.0 2 0 2 0 C D 0 0
from C to D
The maximum clearance is 1.5m.
CCC 1000 holiday in europe x2\n\nLocal experience. 0 0 2 0 0 C D 0 0
from C to D
DDD 1000 holiday in Fiji x2\n\nLocal experience. 0 0 0 0 2 A D 2 0
from A to D. there are 2 flights of stairs.
我正在考虑使用两个for循环来迭代行
for i in df.iterrows():
for j in continents:
if j in i:
但是我不知道如何在数据框中标记它们。
答案 0 :(得分:2)
解决方案:
continents = ['BRITAIN / IRELAND / SCOTLAND', 'africa', 'asia', 'europe',
'3 seater couch','4 seater+ couch','60 inches TV','60 inches+ TV',
'Washer - front loader','Box / bag / misc']
#print (df)
#https://stackoverflow.com/a/56355175/2901002
df1 = df['TEXT'].str.extractall(r'(.*?)\s+x\s*(\d+)')
#get values from continents list
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in continents)
df1[0] = df1[0].str.extract('('+ pat + ')', expand=False).fillna('NOTHING').str.upper()
df1[1] = df1[1].astype(int)
df1 = df1.set_index(0, append=True)[1].unstack(fill_value=0).sum(level=0)
print (df1)
0 3 SEATER COUCH 4 SEATER+ COUCH 60 INCHES TV 60 INCHES+ TV AFRICA ASIA \
0 1 1 1 1 2 2
1 0 0 0 0 0 2
2 0 0 0 0 0 0
3 0 0 0 0 0 0
0 BOX / BAG / MISC BRITAIN / IRELAND / SCOTLAND EUROPE NOTHING \
0 1 0 2 0
1 0 2 0 0
2 0 0 2 0
3 0 0 0 2
0 WASHER - FRONT LOADER
0 1
1 0
2 0
3 0
#add to original dataframe
df = df[['ID','PRICE','TEXT']].join(df1)
#extract another values
df[['FROM','TO']] = df['TEXT'].str.extract('from ([A-Za-z]+)\s+to\s+([A-Za-z]+)')
df['STAIRES_COUNT'] = (df['TEXT'].str.extract('(\d+)\s+flights of staires')
.fillna('0').astype(int))
df['MAX_CLEARANCE_M'] = df['TEXT'].str.extract('(\d+\.\d+|\d+)m.').fillna('0').astype(float)
print (df)
ID PRICE TEXT \
0 AAA 1000 travel to africa x 2\ asia x 2\ europe x 2\ Ai...
1 BBB 1000 trip to asia x 2\ BRITAIN / IRELAND / SCOTLAND...
2 CCC 1000 holiday in europe x2\Local experience. from ...
3 DDD 1000 holiday in Fiji x2\ Local experience. from A t...
3 SEATER COUCH 4 SEATER+ COUCH 60 INCHES TV 60 INCHES+ TV AFRICA ASIA \
0 1 1 1 1 2 2
1 0 0 0 0 0 2
2 0 0 0 0 0 0
3 0 0 0 0 0 0
BOX / BAG / MISC BRITAIN / IRELAND / SCOTLAND EUROPE NOTHING \
0 1 0 2 0
1 0 2 0 0
2 0 0 2 0
3 0 0 0 2
WASHER - FRONT LOADER FROM TO STAIRES_COUNT MAX_CLEARANCE_M
0 1 A B 2 1.5
1 0 C D 0 1.5
2 0 C D 0 0.0
3 0 A D 2 0.0
答案 1 :(得分:0)
数据
ID PRICE TEXT AFRICA \
0 AAA 1000 travel to africa x 2 asia x 5 Airport pick up ... 0
1 BBB 1000 trip to asia x 1 Worthwhile experience. 0
2 CCC 1000 holiday in europe x2 Local experience. 0
ASIA EUROPE NOTHING
0 0 0 0
1 0 0 0
2 0 0 0
data = df.TEXT.apply(lambda x: [re.findall('(?<='+a+'\sx)'+'\s?\d+',x) for a in list(df.columns[3:6].str.lower())])
for i in df.index:
df.loc[i,df.columns[3:6]] = [j[0] if j else 0 for j in l[i]]
输出
ID PRICE TEXT AFRICA \
0 AAA 1000 travel to africa x 2 asia x 5 Airport pick up ... 2
1 BBB 1000 trip to asia x 1 Worthwhile experience. 0
2 CCC 1000 holiday in europe x2 Local experience. 0
ASIA EUROPE NOTHING
0 5 0 0
1 1 0 0
2 0 2 0