计算拆分字符串中的辅音和元音

时间:2019-07-02 02:23:08

标签: python pandas dataframe

我是Pandas的新手,我一直在探索在线找到的.csv文件。我有以下数据框架,该数据框架在“说明”列的字符串中计算元音和辅音。这项工作很棒,但是我的问题是我想将描述分为8列,并计算每列的辅音和元音。代码的第二部分允许我将描述分为8列。如何在描述分为的所有8列中计算元音和辅音?

import pandas as pd
import re

def anti_vowel(s):
    result = re.sub(r'[AEIOU]', '', s, flags=re.IGNORECASE)
    return result

data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')

data.dropna(inplace = True)

data['Vowels'] = data['Description'].str.count(r'[aeiou]', flags=re.I)
data['Consonant'] = data['Description'].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)

print (data)

这是我用来将“说明”列分为8列的代码。

import pandas as pd
data = data["Description"].str.split(" ", n = 8, expand = True)
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')

data.dropna(inplace = True)

data = data["Description"].str.split(" ", n = 8, expand = True)

print (data)

现在如何将它们放在一起?

为了读取8的每一列并计数辅音,我知道我可以使用以下代码将0替换为0-7:

testconsonant = data[0].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
testvowel = data[0].str.count(r'[aeiou]', flags=re.I)

所需的输出为:

说明[0]元音计数辅音计数说明[1]元音计数辅音计数说明[2]元音计数辅音计数说明[3]元音计数辅音计数说明[4]元音计数辅音计数一直到说明[7] ]

1 个答案:

答案 0 :(得分:3)

stack然后unstack

stacked = data.stack()
pd.concat({
    'Vowels': stacked.str.count('[aeiou]', flags=re.I),
    'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()

      Consonant                                         Vowels                                        
              0    1    2    3    4    5    6    7    8      0    1    2    3    4    5    6    7    8
0           3.0  5.0  5.0  1.0  2.0  NaN  NaN  NaN  NaN    1.0  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
1           8.0  5.0  1.0  0.0  0.0  0.0  0.0  0.0  NaN    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  NaN
2           8.0  5.0  1.0  0.0  0.0  0.0  0.0  0.0  NaN    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  NaN
3           8.0  5.0  1.0  0.0  0.0  0.0  0.0  0.0  NaN    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  NaN
4           3.0  5.0  3.0  1.0  0.0  0.0  0.0  0.0  NaN    0.0  0.0  2.0  0.0  0.0  0.0  0.0  0.0  NaN
5           3.0  5.0  3.0  1.0  0.0  0.0  0.0  0.0  NaN    0.0  0.0  2.0  0.0  0.0  0.0  0.0  0.0  NaN
6           3.0  4.0  0.0  1.0  0.0  0.0  0.0  NaN  NaN    3.0  1.0  0.0  0.0  0.0  0.0  0.0  NaN  NaN
7           3.0  3.0  0.0  1.0  0.0  0.0  0.0  NaN  NaN    3.0  1.0  0.0  1.0  0.0  0.0  0.0  NaN  NaN
8           3.0  3.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0    3.0  1.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
9           3.0  3.0  0.0  1.0  0.0  0.0  0.0  NaN  NaN    3.0  1.0  0.0  1.0  0.0  0.0  0.0  NaN  NaN
10          3.0  3.0  0.0  1.0  0.0  0.0  0.0  0.0  NaN    3.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  NaN
11          3.0  3.0  0.0  2.0  2.0  NaN  NaN  NaN  NaN    3.0  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
12          3.0  3.0  0.0  1.0  0.0  0.0  0.0  0.0  NaN    3.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  NaN
13          3.0  3.0  0.0  2.0  2.0  NaN  NaN  NaN  NaN    3.0  1.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
14          3.0  5.0  0.0  2.0  0.0  0.0  0.0  0.0  0.0    3.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
15          3.0  3.0  0.0  3.0  1.0  NaN  NaN  NaN  NaN    3.0  0.0  0.0  0.0  1.0  NaN  NaN  NaN  NaN

如果要将其与data数据框结合使用,则可以执行以下操作:

stacked = data.stack()
pd.concat({
    'Data': data,
    'Vowels': stacked.str.count('[aeiou]', flags=re.I),
    'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()