我有下面几行的数据(尽管当然比示例多得多)。数据可以以不同的顺序出现。
df = pd.DataFrame({'SmVariant': ['1xFBBC', float('nan'), '2xFBBA', '5xABIA', \
'2xFBBC, 1xFBBA', '1xFBBA', '4xABIA', \
'1xFBBA, 1xFBBC', float('nan'), '1xFBBA', \
'3xFBBA, 1xFBBC']})
我想分成这样的数字列:(最终求和)
FBBA FBBC ABIA
1
2
5
1 2
1
4
1 1
1
3 1
答案 0 :(得分:0)
我假设您的意思是熊猫DataFrame。我还假设您预先了解元素的不同类型,并且可以像这样将它们放入字典中(将元素映射到最后的列中:
{% url 'pweuser:user_password_sms_reset' token=value1 userId=value2 %}
接下来编写一个将特定元素转换为多列的函数:
cols={'AAAA':0, 'BBBB': 1, 'CCCC': 2}
最后,使用该函数,并将其应用于数据框中的每个元素,如下所示:
def expand_element(el):
res = [0]*len(cols)
for item in el.split(','):
q, name = item.split('x')
res[cols[name]]=int(q)
return res
这是我的交互式会话,显示输入和输出:
df.apply(lambda x: expand_element(x[0]), axis=1, result_type='expand')
答案 1 :(得分:0)
您可以使用regex
+ pandas
方法链接如下一行进行操作。我将其分为多行以提高可读性。有关更多详细信息,请参见下面的 C部分。 ?⭐
注意:A和B部分使用OP先前共享的D部分中的数据。后来,问题中的数据被更改, C部分给出了该用例的解决方案。
正则表达式示例
为了解释regex-pattern
的工作原理,请看以下三个示例:
# without alphabetically ordering the columns
(df[COLUMN_NAME] ## access the "data"-column
.fillna('0xUNKN') ## replace nan values with 0xUNKN
.str.findall(pat) ## use regex to extract patterns
.apply(lambda x: dict((k, v) for v, k in x if (int(v)!=0))) ## row-wise create dict to construct final {column: count} structure
.apply(pd.Series) ## use dict to create columns
.fillna(0) ## replace NaN values with 0
)
在这里,我解释每个操作的作用,最后按字母顺序对列进行重新排序。
正则表达式说明:example-1
在这里,详细了解正则表达式
(\d+)x(\w+)\s*,\s*(\d+)x(\w+)
如何从输入文本中提取各种预期的部分:example-1。
# NOTE: I am using the dataframe that I created in
# the Dummy Data section "below"
df2 = (df.data # access the "data"-column
.str.findall('(\d+)x(\w+)\s*,\s*(\d+)x(\w+)') # use regex to extract patterns
.explode() # explode each rows' list into columns
.apply(lambda x: {x[1]: x[0], x[3]: x[2]}) # row-wise create dict to construct final {column: count} structure
.apply(pd.Series) # expand each cell into columns
.fillna(0) # replace NaN values with 0
)
df2 = df2.reindex(sorted(df2.columns), axis=1) # alphabetically reorder columns
print(df2)
输出:
AAAA BBBB CCCC
0 1 1 0
1 1 2 0
2 1 0 1
如果每行有两种以上的类型(例如AAAA
,BBBB
,CCCC
),则在这种情况下,以下解决方案也将适用。
正则表达式说明:example-2
在这里,详细了解正则表达式
(?:\s*(\d+)x(\w+)\s*)+
如何从输入文本中提取各种预期的部分:example-2。
import pandas as pd
## Dummy Data
data = [
'1xAAAA,2xBBBB,3xDDDD',
'1xBBBB,1xAAAA,6xEEEE',
'1xAAAA,1xCCCC,3xDDDD',
]
df = pd.DataFrame(data, columns=['data'])
print('\n Input:')
print(df)
## Output:
# data
# 0 1xAAAA,2xBBBB,3xDDDD
# 1 1xBBBB,1xAAAA,6xEEEE
# 2 1xAAAA,1xCCCC,3xDDDD
## Process DataFrame
# define regex pattern
pat = '(?:\s*(\d+)x(\w+)\s*)+' # regex search pattern
# create dataframe in the expected format
df2 = (df.data ## access the "data"-column
.str.findall(pat) ## use regex to extract patterns
.apply(lambda x: dict((k, v) for v, k in x)) ## row-wise create dict to construct final {column: count} structure
.apply(pd.Series) ## use dict to create columns
.fillna(0) ## replace NaN values with 0
)
df2 = df2.reindex(sorted(df2.columns), axis=1) ## alphabetically reorder columns
print('\n Output:')
print(df2)
## Output:
# AAAA BBBB CCCC DDDD EEEE
# 0 1 2 0 3 0
# 1 1 1 0 0 6
# 2 1 0 1 3 0
这是OP共享的特定样本数据的示例。这个特定的用例显示了数据框中存在 nan 值。作为使用经过修改的先前建议的解决方案的一种策略,您可以仅使用字符串replace
0xUNKN
那些 nan 值,然后过滤不以a开头的结果0
。
import pandas as pd
COLUMN_NAME = 'SmVariant'
## Dummy Data
data = [
'1xFBBC', float('nan'),
'2xFBBA', '5xABIA',
'2xFBBC, 1xFBBA',
'1xFBBA', '4xABIA',
'1xFBBA, 1xFBBC',
float('nan'), '1xFBBA',
'3xFBBA, 1xFBBC',
]
df = pd.DataFrame({COLUMN_NAME: data})
print('\n Input:')
print(df)
## Output:
# SmVariant
# 0 1xFBBC
# 1 NaN
# 2 2xFBBA
# 3 5xABIA
# 4 2xFBBC, 1xFBBA
# 5 1xFBBA
# 6 4xABIA
# 7 1xFBBA, 1xFBBC
# 8 NaN
# 9 1xFBBA
# 10 3xFBBA, 1xFBBC
## Process DataFrame
# define regex pattern
pat = '(?:\s*(\d+)x(\w+)\s*)+' # regex search pattern
# create dataframe in the expected format
df2 = (df[COLUMN_NAME] ## access the "data"-column
.fillna('0xUNKN') ## replace nan values with 0xUNKN
.str.findall(pat) ## use regex to extract patterns
.apply(lambda x: dict((k, v) for v, k in x if (int(v)!=0))) ## row-wise create dict to construct final {column: count} structure
.apply(pd.Series) ## use dict to create columns
.fillna(0) ## replace NaN values with 0
)
df2 = df2.reindex(sorted(df2.columns), axis=1) ## alphabetically reorder columns
print('\n Output:')
print(df2)
## Output:
# ABIA FBBA FBBC
# 0 0 0 1
# 1 0 0 0
# 2 0 2 0
# 3 5 0 0
# 4 0 1 2
# 5 0 1 0
# 6 4 0 0
# 7 0 1 1
# 8 0 0 0
# 9 0 1 0
# 10 0 3 1
import pandas as pd
data = {
'1xAAAA,2xBBBB',
'1xBBBB,1xAAAA',
'1xAAAA,1xCCCC',
}
df = pd.DataFrame(data, columns=['data'])
print(df)
## Output:
# data
# 0 1xBBBB,1xAAAA
# 1 1xAAAA,2xBBBB
# 2 1xAAAA,1xCCCC