我有一个存储在字符串中的值。我想将该值仅附加到符合特定条件的行,而不是任何其他行。
下图显示了我需要解析的表。我可以使用BeautifulSoup
轻松解析文件并将其转换为Pandas
DataFrame,但对于下面的两个表格,我很难捕获并将Package
价格附加到整个DataFrame 。理想情况下,价格值将与每个鱼重对一起出现;所以具有相同Price值的单个列。
以下是我用来解析表格的代码:
with open(file_path) as in_f:
msg = email.message_from_file(in_f) #type: <class 'email.message.Messgae'>
html_msg = msg.get_payload(1) #type: <class 'email.message.Message'>
body = html_msg.get_payload(decode=True) #type: <class 'bytes'> or type: 'int'
html = body.decode() #type: <class 'str'>
tablez = BeautifulSoup(html).find_all("table") #type: <class 'bs4.element.ResultSet'>
data = []
for table in tablez:
for row in table.find_all("tr"):
data.append([cell.text.strip() for cell in row.find_all("td")])
fish_frame = pd.DataFrame(data)
这是data
:
data: [['Species', 'Price', 'Weight'], ['GBW Cod', '.55', '8,059'], ['GBE Haddock', '.03', '14,628'], ['GBW Haddock', '.02', '87,451'], ['GB YT', '1.50', '1,818'], ['Witch', '1.25', '1,414'], ['GB Winter', '.40', '23,757'], ['Redfish', '.02', '123'], ['White Hake', '.40', '934'], ['Pollock', '.02', '7,900'], ['Package Price:', '', '$21,151.67'], ['Species', 'Weight'], ['GBE Cod', '820'], ['GBW Cod', '15,279'], ['GBE Haddock', '32,250'], ['GBW Haddock', '192,793'], ['GB YT', '6,239'], ['SNE YT', '2,018'], ['GOM YT', '1,511'], ['Plaice', '2,944'], ['Witch', '1,100'], ['GB Winter', '158,608'], ['White Hake', '31'], ['Pollock', '1,983'], ['SNE Winter', '7,257'], ['Price', '$58,500.00'], ['Species', 'Weight'], ['GBE Cod', '792'], ['GBW Cod', '14,767'], ['GBE Haddock', '29,199'], ['GBW Haddock', '174,556'], ['GB YT', '5,268'], ['SNE YT', '544'], ['GOM YT', '1,957'], ['Plaice', '2,452'], ['Witch', '896'], ['GB Winter', '163,980'], ['White Hake', '8'], ['Pollock', '1,743'], ['SNE Winter', '3,709'], ['Price', '$57,750.00']]
然后我使用这段代码来捕获Package
价格:
stew = BeautifulSoup(html, 'html.parser')
chunks = stew.find_all('p', {'class' : "MsoNormal"})
for line in chunks:
if 'Package' in line.text:
package_price = line.text
print("package_price:", package_price)
但我现在正在努力将价格值添加到数据框中的自己的列中。执行fish_frame = pd.DataFrame(package_price)
等命令会导致:
Traceback (most recent call last):
File "Z:/Code/NEFS_stock_then_weight_attempt3.py", line 236, in <module>
fish_frame = pd.DataFrame(package_price)
File "C:\Users\stephen.mahala\AppData\Local\Programs\Python\Python35-32\lib\site-packages\pandas\core\frame.py", line 345, in __init__
raise PandasError('DataFrame constructor not properly called!')
pandas.core.common.PandasError: DataFrame constructor not properly called!
由于我不知道的原因。但是,将其转换为list
会导致字符串被分解,并且每个字符都会成为自己的列表,因此每个字符都会成为DataFrame中自己的单元格。
是否有Pandas
或BeautifulSoup
的方法,我不知道这会简化将此单个值添加到我的DataFrame的过程?
答案 0 :(得分:1)
当我从fish_frame
创建pd.DataFrame(data)
时,我会得到以下内容,其中包含两个表格数据集:
0 1 2
0 Species Price Weight
1 GBW Cod .55 8,059
2 GBE Haddock .03 14,628
3 GBW Haddock .02 87,451
4 GB YT 1.50 1,818
5 Witch 1.25 1,414
6 GB Winter .40 23,757
7 Redfish .02 123
8 White Hake .40 934
9 Pollock .02 7,900
10 Package Price: $21,151.67
11 Species Weight None
12 GBE Cod 820 None
13 GBW Cod 15,279 None
14 GBE Haddock 32,250 None
15 GBW Haddock 192,793 None
16 GB YT 6,239 None
17 SNE YT 2,018 None
18 GOM YT 1,511 None
19 Plaice 2,944 None
20 Witch 1,100 None
21 GB Winter 158,608 None
22 White Hake 31 None
23 Pollock 1,983 None
24 SNE Winter 7,257 None
25 Price $58,500.00 None
26 Species Weight None
27 GBE Cod 792 None
28 GBW Cod 14,767 None
29 GBE Haddock 29,199 None
30 GBW Haddock 174,556 None
31 GB YT 5,268 None
32 SNE YT 544 None
33 GOM YT 1,957 None
34 Plaice 2,452 None
35 Witch 896 None
36 GB Winter 163,980 None
37 White Hake 8 None
38 Pollock 1,743 None
39 SNE Winter 3,709 None
40 Price $57,750.00 None
如果你摆脱外圈for table in tablez:
而只是做for row in tablez[0]
我认为你最终将会:
data = [['Species', 'Price', 'Weight'], ['GBW Cod', '.55', '8,059'],
['GBE Haddock', '.03', '14,628'], ['GBW Haddock', '.02', '87,451'],
['GB YT', '1.50', '1,818'], ['Witch', '1.25', '1,414'],
['GB Winter', '.40', '23,757'], ['Redfish', '.02', '123'],
['White Hake', '.40', '934'], ['Pollock', '.02', '7,900'],
['Package Price:', '', '$21,151.67']]
然后fish_frame=pd.DataFrame(data)
将导致:
0 1 2
0 Species Price Weight
1 GBW Cod .55 8,059
2 GBE Haddock .03 14,628
3 GBW Haddock .02 87,451
4 GB YT 1.50 1,818
5 Witch 1.25 1,414
6 GB Winter .40 23,757
7 Redfish .02 123
8 White Hake .40 934
9 Pollock .02 7,900
10 Package Price: $21,151.67
无论您是否进行了更改,都会向fish_frame
添加一列:
srs = pd.Series([package_price]*len(fish_frame))
fish_frame[3] = pd.Series(srs,index=fish_frame.index)
然后你应该结束:
0 1 2 3
0 Species Price Weight #891-2: Package for $21,151.67 but willing to sell species individually
1 GBW Cod .55 8,059 #891-2: Package for $21,151.67 but willing to sell species individually
2 GBE Haddock .03 14,628 #891-2: Package for $21,151.67 but willing to sell species individually
3 GBW Haddock .02 87,451 #891-2: Package for $21,151.67 but willing to sell species individually
4 GB YT 1.50 1,818 #891-2: Package for $21,151.67 but willing to sell species individually
...