Python - 将str值附加到DataFrame中的某些行

时间:2017-06-27 18:52:48

标签: python pandas email dataframe beautifulsoup

我有一个存储在字符串中的值。我想将该值仅附加到符合特定条件的行,而不是任何其他行。

下图显示了我需要解析的表。我可以使用BeautifulSoup轻松解析文件并将其转换为Pandas DataFrame,但对于下面的两个表格,我很难捕获并将Package价格附加到整个DataFrame 。理想情况下,价格值将与每个鱼重对一起出现;所以具有相同Price值的单个列。

enter image description here

以下是我用来解析表格的代码:

with open(file_path) as in_f:
    msg = email.message_from_file(in_f) #type: <class 'email.message.Messgae'>

html_msg = msg.get_payload(1)   #type: <class 'email.message.Message'>

body = html_msg.get_payload(decode=True)    #type: <class 'bytes'> or type: 'int'

html = body.decode()    #type: <class 'str'>

tablez = BeautifulSoup(html).find_all("table")  #type: <class 'bs4.element.ResultSet'>
data = []
for table in tablez:
    for row in table.find_all("tr"):
        data.append([cell.text.strip() for cell in row.find_all("td")])

fish_frame = pd.DataFrame(data)

这是data

data: [['Species', 'Price', 'Weight'], ['GBW Cod', '.55', '8,059'], ['GBE Haddock', '.03', '14,628'], ['GBW Haddock', '.02', '87,451'], ['GB YT', '1.50', '1,818'], ['Witch', '1.25', '1,414'], ['GB Winter', '.40', '23,757'], ['Redfish', '.02', '123'], ['White Hake', '.40', '934'], ['Pollock', '.02', '7,900'], ['Package Price:', '', '$21,151.67'], ['Species', 'Weight'], ['GBE Cod', '820'], ['GBW Cod', '15,279'], ['GBE Haddock', '32,250'], ['GBW Haddock', '192,793'], ['GB YT', '6,239'], ['SNE YT', '2,018'], ['GOM YT', '1,511'], ['Plaice', '2,944'], ['Witch', '1,100'], ['GB Winter', '158,608'], ['White Hake', '31'], ['Pollock', '1,983'], ['SNE Winter', '7,257'], ['Price', '$58,500.00'], ['Species', 'Weight'], ['GBE Cod', '792'], ['GBW Cod', '14,767'], ['GBE Haddock', '29,199'], ['GBW Haddock', '174,556'], ['GB YT', '5,268'], ['SNE YT', '544'], ['GOM YT', '1,957'], ['Plaice', '2,452'], ['Witch', '896'], ['GB Winter', '163,980'], ['White Hake', '8'], ['Pollock', '1,743'], ['SNE Winter', '3,709'], ['Price', '$57,750.00']]

然后我使用这段代码来捕获Package价格:

stew = BeautifulSoup(html, 'html.parser')
chunks = stew.find_all('p', {'class' : "MsoNormal"})        
for line in chunks:
    if 'Package' in line.text:
        package_price = line.text
        print("package_price:", package_price)

但我现在正在努力将价格值添加到数据框中的自己的列中。执行fish_frame = pd.DataFrame(package_price)等命令会导致:

Traceback (most recent call last): File "Z:/Code/NEFS_stock_then_weight_attempt3.py", line 236, in <module> fish_frame = pd.DataFrame(package_price) File "C:\Users\stephen.mahala\AppData\Local\Programs\Python\Python35-32\lib\site-packages\pandas\core\frame.py", line 345, in __init__ raise PandasError('DataFrame constructor not properly called!') pandas.core.common.PandasError: DataFrame constructor not properly called!

由于我不知道的原因。但是,将其转换为list会导致字符串被分解,并且每个字符都会成为自己的列表,因此每个字符都会成为DataFrame中自己的单元格。

是否有PandasBeautifulSoup的方法,我不知道这会简化将此单个值添加到我的DataFrame的过程?

1 个答案:

答案 0 :(得分:1)

当我从fish_frame创建pd.DataFrame(data)时,我会得到以下内容,其中包含两个表格数据集:

                 0           1           2
0          Species       Price      Weight
1          GBW Cod         .55       8,059
2      GBE Haddock         .03      14,628
3      GBW Haddock         .02      87,451
4            GB YT        1.50       1,818
5            Witch        1.25       1,414
6        GB Winter         .40      23,757
7          Redfish         .02         123
8       White Hake         .40         934
9          Pollock         .02       7,900
10  Package Price:              $21,151.67
11         Species      Weight        None
12         GBE Cod         820        None
13         GBW Cod      15,279        None
14     GBE Haddock      32,250        None
15     GBW Haddock     192,793        None
16           GB YT       6,239        None
17          SNE YT       2,018        None
18          GOM YT       1,511        None
19          Plaice       2,944        None
20           Witch       1,100        None
21       GB Winter     158,608        None
22      White Hake          31        None
23         Pollock       1,983        None
24      SNE Winter       7,257        None
25           Price  $58,500.00        None
26         Species      Weight        None
27         GBE Cod         792        None
28         GBW Cod      14,767        None
29     GBE Haddock      29,199        None
30     GBW Haddock     174,556        None
31           GB YT       5,268        None
32          SNE YT         544        None
33          GOM YT       1,957        None
34          Plaice       2,452        None
35           Witch         896        None
36       GB Winter     163,980        None
37      White Hake           8        None
38         Pollock       1,743        None
39      SNE Winter       3,709        None
40           Price  $57,750.00        None

如果你摆脱外圈for table in tablez:而只是做for row in tablez[0]我认为你最终将会:

data = [['Species', 'Price', 'Weight'], ['GBW Cod', '.55', '8,059'],
        ['GBE Haddock', '.03', '14,628'], ['GBW Haddock', '.02', '87,451'], 
        ['GB YT', '1.50', '1,818'], ['Witch', '1.25', '1,414'], 
        ['GB Winter', '.40', '23,757'], ['Redfish', '.02', '123'], 
        ['White Hake', '.40', '934'], ['Pollock', '.02', '7,900'], 
        ['Package Price:', '', '$21,151.67']]

然后fish_frame=pd.DataFrame(data)将导致:

                 0      1           2
0          Species  Price      Weight
1          GBW Cod    .55       8,059
2      GBE Haddock    .03      14,628
3      GBW Haddock    .02      87,451
4            GB YT   1.50       1,818
5            Witch   1.25       1,414
6        GB Winter    .40      23,757
7          Redfish    .02         123
8       White Hake    .40         934
9          Pollock    .02       7,900
10  Package Price:         $21,151.67

无论您是否进行了更改,都会向fish_frame添加一列:

srs = pd.Series([package_price]*len(fish_frame))
fish_frame[3] = pd.Series(srs,index=fish_frame.index)

然后你应该结束:

                 0      1           2    3
0          Species  Price      Weight    #891-2: Package for $21,151.67 but willing to sell species individually
1          GBW Cod    .55       8,059    #891-2: Package for $21,151.67 but willing to sell species individually
2      GBE Haddock    .03      14,628    #891-2: Package for $21,151.67 but willing to sell species individually
3      GBW Haddock    .02      87,451    #891-2: Package for $21,151.67 but willing to sell species individually
4            GB YT   1.50       1,818    #891-2: Package for $21,151.67 but willing to sell species individually
...