Question

我是数据分析的Python / Pandas初学者。我试图从字母频率上的维基百科文章导入（/刮）一个表，清理它，并将其转换为数据框。

以下是我用来将表格转换为名为letter_freq_all的数据框的代码：

import pandas as pd
import numpy as np

letter_freq_all = pd.read_html('http://en.wikipedia.org/wiki/Letter_frequency', header=0)[4]
letter_freq_all

我想清理数据并正确格式化以进行数据分析：

我想从列名中删除带有数字的方括号，并确保两边都没有空格填充
我还想从每列中删除百分号和任何星号，这样我就可以将每个列转换为浮点类型。
到目前为止，我没有尝试从所有列中删除％符号。

这是我尝试过的代码：

letter_freq_all2 = [str.replace(i,'%','') for i in letter_freq_all]

我没有获得没有任何％符号的新数据框，而是获得了letter_freq_all中所有列的列表：

['Letter','French [14]','German [15]','Spanish [16]','Portuguese [17]','Esperanto  [18]','Italian[19]','Turkish[20]','Swedish[21]','Polish[22]','Dutch [23]','Danish[24]','Icelandic[25]','Finnish[26]','Czech']

然后我尝试在一列中删除％符号：

letter_freq_all3 = [str.replace(i,'%','') for i in letter_freq_all['Italian[19]']]**

当我这样做时，str.replace方法有点工作 - 我得到了一个没有任何%符号的列表（我期待得到一个系列）。

那么，如何摆脱数据框%中所有列中的letter_freq_all符号？另外，如何摆脱所有列中的所有括号和额外的空格填充？我猜我可能不得不使用.split()方法

Answer 1

对于数据分析，使用float而不是string条目是有意义的。所以你可以写一个试图转换每个条目的函数：

def f(s):
    """ convert string to float if possible """
    s = s.strip()  # remove spaces at beginning and end of string
    if s.endswith('%'):  # remove %, if exists
        s = s[:-1]
    try:
        return float(s)
    except ValueError: # converting did not work
        return s  # return original string

lf2 = letter_freq_all.applymap(f)  # convert all entries

Answer 2

实现目标的最简洁方法是将str.replace（）方法与正则表达式一起使用：

1）重命名列：

letter_freq_all.columns = pd.Series(letter_freq_all.columns).str.replace('\[\d+\]', '').str.strip()

2）替换星号和百分号并转换为小数：

letter_freq_all.apply(lambda x: x.str.replace('[%*]', '').astype(float)/100, axis=1)

在这种情况下，apply（）对每列执行str.replace（）方法。

在此处了解有关正则表达式元字符的更多信息：

https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php

Answer 3

认为这很有效。我已经使用熊猫的broadcasting capabilities一次替换1列（实际上是几列）中的值。

# Ignore first col with letters in it.
cols = letter_freq_all.columns[1:]

# Replace the columns `cols` in the DF
letter_freq_all[cols] = (
    letter_freq_all[cols]
    # Replace things that aren't numbers and change any empty entries to nan
    # (to allow type conversion)
    .replace({r'[^0-9\.]': '', '': np.nan}, regex=True)
    # Change to float and convert from %s
    .astype(np.float64) / 100
)

letter_freq_all.head()


 Letter  French [14]  German [15]  Spanish [16]  Portuguese [17]  ...
0      a      0.07636      0.06516       0.11525          0.14634   
1      b      0.00901      0.01886       0.02215          0.01043   
2      c      0.03260      0.02732       0.04019          0.03882   
3      d      0.03669      0.05076       0.05510          0.04992   
4      e      0.14715      0.16396       0.12681          0.11570

如何在刮掉的Pandas数据帧中的所有列上使用`str.replace（）`方法？

3 个答案: