我有一个包含信息data1
的数据框,想添加一列data2
,其中仅包含data1
的名称:
data1 data2
0 info name: Michael Jackson New York Michael Jackson
1 info 12 name: Michael Jordan III Los Angeles Michael Jordan III
您知道我该怎么做吗?
答案 0 :(得分:0)
没有明确的定界符,这并非易事,因为名称中有两个空格,多个名称长度(2个单词,3个单词),并且尾随列也可能有多个单词带有空格。
拆分字符串可以实现部分解决方案:
df['data2'] = df['data1'].str.split(': ').str[-1]
>>> print(df)
data1 data2
0 info name: Michael Jackson New York Michael Jackson New York
1 info 12 name: Michael Jordan III Los Angeles Michael Jordan III Los Angeles
如果您有“城市”列表,则可以完成完整的解决方案:
def replace(string, substitutions):
"""Replaces multiple substrings in a string."""
substrings = sorted(substitutions, key=len, reverse=True)
regex = re.compile('|'.join(map(re.escape, substrings)))
return regex.sub(lambda match: substitutions[match.group(0)], string)
# List of cities to remove from strings
cities = ['New York', 'Los Angeles']
# Dictionary matching each city with the empty string
substitutions = {city:'' for city in cities}
# Splitting to create new column as above
df['data2'] = df['data1'].str.split(': ').str[-1]
# Applying replacements to new column
df['data2'] = df['data2'].map(lambda x: replace(x, substitutions).strip())
>>>print(df)
data1 data2
0 info name: Michael Jackson New York Michael Jackson
1 info 12 name: Michael Jordan III Los Angeles Michael Jordan III
使用carlsmith替换功能。