我有两个不同的for
循环,它们运行相同的次数并在每次迭代时生成一个字符串。 (我正在抓取一个html文件)我希望第一个循环中的字符串合并/连接/附加来自第二个循环的字符串FOR EACH ITERATION(这是棘手的部分)这是我的代码:
from bs4 import BeautifulSoup
bsObj = BeautifulSoup(open("samfull.html"), "html.parser")
tableList = bsObj.find_all("table", {"class":"width100 menu_header_top_emr"})
tdList = bsObj.find_all("td", {"class":"menu_header width100"})
for table in tableList:
first_part_of_row_string = ''
item = table.find_all("span", {"class":"results_body_text"})
for i in range(len(item)):
first_part_of_row_string += (item[i].get_text().strip() + ", ")
for td in tdList:
second_part_of_row_string = ''
items = td.find_all("span", {"class":"results_body_text"})
for i in range(len(items)):
second_part_of_row_string += (items[i].get_text().strip() + ", ")
举个例子:
for table in tableList
循环的示例结果是:
a,b,
1,2,
father, mother,
和for td in tdList
循环是:
c, d, e,
3, 4, 5,
son, daughter, twin,
我想将每次迭代的first_part_of_row_string
与每次迭代的second_part_of_row_string
结合起来
所以我想打印出来:
a, b, c, d, e,
1, 2, 3, 4, 5
father, mother, son, daughter, twin,
这两个循环的每次迭代都有效first_part_of_row_string + second_part_of_row_string
tableList和tdList的长度相同,因此两个循环将始终返回相同的行数。如果td在tableList中引用的同一个表中,我可以在一个循环中,不幸的是它不是。在html中,tableList定义中指定的类的表后面跟着另一个没有类的表,但是总是包含一个带有tdList中指定的类的td。下面包含此html的示例。整个页面有几千行,所以我把它放在一个单独的链接上。link
<table cellspacing="0" cellpadding="0"
style="margin-left: auto; margin-right: auto;" class="width100 menu_header_top_emr">
<tbody>
<tr>
<td style="width:80px;">
<div style="width:70px;background-color:#B2EE98; border:1px solid grey; padding:2px 5px 2px 5px; text-align:center;">Entity</div>
</td>
<td style="padding-left:5px;">
<span class="results_body_text"><h5 style="vertical-align: middle;">Rascal X-Press, Inc.</h5></span>
</td>
<td style="width:130px;">
<div class="right">
<span class="results_title_text">Status:</span>
<span class="results_body_text">
Submitted
</span>
</div>
</td>
<td style="width:22px;">
<a href="" class="more_duns_link_emr right" style="display: inline;"><img
id="more_duns_link_emr"
src="/SAMSearch/styles/img/expand-small-blue.png" style="padding:8px 8px 8px 2px;"
alt="Expand Search Result for Rascal X-Press, Inc."></a>
<a href="" class="hide_duns_link_emr off right" style="display: none;"><img
id="hide_duns_link_emr"
src="/SAMSearch/styles/img/collapse-small-blue.png" style="padding:8px 8px 8px 2px;"
alt="Collapse Search Result for Rascal X-Press, Inc."></a>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td class="menu_header width100">
<table>
<tr>
<td style="width:25%;">
<span class="results_title_text">DUNS:</span> <span class="results_body_text"> 012361296</span>
</td>
<td style="width:25%;">
</td>
<!-- label as CAGE when US Territory is listed as Country -->
<td style="width:27%;">
<span class="results_title_text">CAGE Code:</span> <span class="results_body_text"></span>
</td>
<td style="width:15%" rowspan="2">
<input type="button" value="View Details" title="View Details for Rascal X-Press, Inc." class="center" style="height:25px; width:90px; vertical-align:middle; margin:7px 3px 7px 3px;" onClick="viewEntry('4420848', '1472652382619')" />
</td>
</tr>
<tr>
<td colspan="2">
<span class="results_title_text">Has Active Exclusion?: </span>
<span class="results_body_text">
No
</span>
</td>
<td>
<span class="results_title_text">DoDAAC:</span> <span class="results_body_text"></span>
</td>
</tr>
<tr>
<td colspan="2">
<span class="results_title_text">Expiration Date:</span>
<span class="results_body_text">
</span>
</td>
<td colspan="2"><span class="results_title_text">Delinquent Federal Debt?</span>
<span class="results_body_text">
No
</span>
</td>
</tr>
<tr>
<td colspan="2"><span class="results_title_text">Purpose of Registration:</span>
<span class="results_body_text">
Federal Assistance Awards Only
</span>
</td>
</tr>
</table>
<div class="off_duns_emr" style="display: none;">
<table class="resultbox1 menu_header width100"
style="margin-left: auto; margin-right: auto;" cellpadding="2">
<tbody>
<tr>
<td colspan="3"><span class="results_title_text">Address:</span>
<span class="results_body_text">1372 State Hwy 37</span></td>
</tr>
<tr>
<td style="width:212px;"><span class="results_title_text">City:</span>
<span class="results_body_text">West Frankfort</span></td>
<td style="width:200px;"><span class="results_title_text">State/Province:</span>
<span class="results_body_text">IL</span></td>
</tr>
<tr>
<td style="width:130px;"><span class="results_title_text">ZIP Code:</span>
<span class="results_body_text">62896-5007</span></td>
<td style="width:200px;"><span class="results_title_text">Country:</span>
<span class="results_body_text">UNITED STATES</span></td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table></td>
</tr>
</tbody>
</table> </li>
</td>
</tr>
</table>
答案 0 :(得分:0)
有很多方法可以做你要求的事情,这里有一个非常简单的方法:
tableList = [
["a", "b"],
["1", "2"],
["father", "mother"]
]
tdList = [
["c", "d", "e"],
["3", "4", "5"],
["son", "daughter", "twin"]
]
len_list = max(len(tableList), len(tdList))
for i in range(len_list):
print ", ".join(tableList[i] + tdList[i])
答案 1 :(得分:0)
使用zip
,并使用join
代替连接逗号:
for table,td in zip(tableList,tdList):
a = ', '.join(table.find_all("span", {"class":"results_body_text"}))
b = ', '.join(td.find_all("span", {"class":"results_body_text"}))
print(a, b, sep=', ')
如果您使用的是Python 3.5,则可以使用更强大的解包语法:
for table,td in zip(tableList,tdList):
a = table.find_all("span", {"class":"results_body_text"})
b = td.find_all("span", {"class":"results_body_text"})
print(*a, *b, sep=', ')
如果您使用的是Python 2,请将行from __future__ import print_function
放在代码顶部并使用Python 3的打印函数语法,或者只需手动加入所有内容:
for table,td in zip(tableList,tdList):
a = table.find_all("span", {"class":"results_body_text"})
b = td.find_all("span", {"class":"results_body_text"})
print ', '.join(a+b)