我有一个dataframe while answer.lower() != "proceed" and answer.lower() != "return":
await ctx.send("Only enter 'proceed' or 'return'!")
await ctx.send('''Are you sure you want to nuke this channel? This will completely erase all messages from it!
type proceed to continue, and return to return. ''')
answer = await client.wait_for('message', check=lambda
message: message.author == ctx.author and message != "") # Gets user input and checks if message is not empty and was sent by the same user
answer = answer.content
,例如:
df
df['user_location'].value_counts()
我想从India 3741
United States 2455
New Delhi, India 1721
Mumbai, India 1401
Washington, DC 1354
...
SpaceCoast,Florida 1
stuck in a book. 1
Beirut , Lebanon 1
Royston Vasey - Tralfamadore 1
Langham, Colchester 1
Name: user_location, Length: 26920, dtype: int64
列中了解USA
,India
等特定国家/地区的频率。然后,我想将频率绘制为user_location
,USA
和India
。
因此,我想对该列进行一些操作,以使Others
的输出为:
value_counts()
似乎我应该合并包含相同国家名称的行的频率,并将其余的合并在一起!但是,在处理城市,州等名称时,它看起来很复杂。最有效的方法是什么?
答案 0 :(得分:1)
在评论中添加到@Trenton_McKinney的答案中,如果您需要将其他国家/地区的州/省映射到该国家/地区名称,则您需要做一些工作来建立这些关联。例如,对于印度和美国,您可以从维基百科上获取它们的州列表,并将其映射到您自己的数据,以将其重新标记为各自的国家名称,如下所示:
# Get states of India and USA
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist()
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
states = in_states + us_states
# Make a sample dataframe
df = pd.DataFrame({'Country': states})
Country
0 Andhra Pradesh
1 Arunachal Pradesh
2 Assam
3 Bihar
4 Chhattisgarh
... ...
73 Virginia[E]
74 Washington
75 West Virginia
76 Wisconsin
77 Wyoming
将州名映射到国家名:
# Map state names to country name
states_dict = {state: 'India' for state in in_states}
states_dict.update({state: 'USA' for state in us_states})
df['Country'] = df['Country'].map(states_dict)
Country
0 India
1 India
2 India
3 India
4 India
... ...
73 USA
74 USA
75 USA
76 USA
77 USA
但是从您的数据样本看来,您还将需要处理很多边缘情况。
答案 1 :(得分:0)
首先,使用上一个答案的概念,我试图获得所有地点,包括城市,工会,州,地区,地区。然后,我制作了一个函数checkl()
,使其可以检查该位置是印度还是美国,然后将其转换为其国家名称。最后,该功能已应用到dataframe列df['user_location']
上:
# Trying to get all the locations of USA and India
import pandas as pd
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
us_cities = pd.read_html(us_url)[0].iloc[:, 1].tolist() + pd.read_html(us_url)[0].iloc[:, 2].tolist() + pd.read_html(us_url)[0].iloc[:, 3].tolist()
us_Federal_district = pd.read_html(us_url)[1].iloc[:, 0].tolist()
us_Inhabited_territories = pd.read_html(us_url)[2].iloc[:, 0].tolist()
us_Uninhabited_territories = pd.read_html(us_url)[3].iloc[:, 0].tolist()
us_Disputed_territories = pd.read_html(us_url)[4].iloc[:, 0].tolist()
us = us_states + us_cities + us_Federal_district + us_Inhabited_territories + us_Uninhabited_territories + us_Disputed_territories
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist() + pd.read_html(in_url)[3].iloc[:, 4].tolist() + pd.read_html(in_url)[3].iloc[:, 5].tolist()
in_unions = pd.read_html(in_url)[4].iloc[:, 0].tolist()
ind = in_states + in_unions
usToStr = ' '.join([str(elem) for elem in us])
indToStr = ' '.join([str(elem) for elem in ind])
# Country name checker function
def checkl(T):
TSplit_space = [x.lower().strip() for x in T.split()]
TSplit_comma = [x.lower().strip() for x in T.split(',')]
TSplit = list(set().union(TSplit_space, TSplit_comma))
res_ind = [ele for ele in ind if(ele in T)]
res_us = [ele for ele in us if(ele in T)]
if 'india' in TSplit or 'hindustan' in TSplit or 'bharat' in TSplit or T.lower() in indToStr.lower() or bool(res_ind) == True :
T = 'India'
elif 'US' in T or 'USA' in T or 'United States' in T or 'usa' in TSplit or 'united state' in TSplit or T.lower() in usToStr.lower() or bool(res_us) == True:
T = 'USA'
elif len(T.split(','))>1 :
if T.split(',')[0] in indToStr or T.split(',')[1] in indToStr :
T = 'India'
elif T.split(',')[0] in usToStr or T.split(',')[1] in usToStr :
T = 'USA'
else:
T = "Others"
else:
T = "Others"
return T
# Appling the function on the dataframe column
print(df['user_location'].dropna().apply(checkl).value_counts())
Others 74206
USA 47840
India 20291
Name: user_location, dtype: int64
我在python编码方面还很陌生。我认为这段代码可以用更好,更紧凑的形式编写。就像在前面的答案中提到的那样,仍然有很多边缘情况需要处理。因此,我也将其添加到了 Code Review Stack Exchange上。对于提高我的代码的效率和可读性的任何批评和建议,将不胜感激。