Question

我试图用here的选举结果解析表格。基本上，我对结果很感兴趣 this table。我发现了许多解析html表的例子，但是在所有这些例子中，感兴趣的数据都以列方式组织。但是，我的目标是从行中提取数据（例如，我想获得第一行，即选举区域的名称等，第二行，即注册选民的数量等）。目前，我能够提取第一列：

sub_url = "http://www.krasnodar.vybory.izbirkom.ru/region/region/krasnodar?action=show&tvd=2232000821586&vrn=2232000821581&region=23&global=&sub_region=23&prver=2&pronetvd=1&vibid=2232000821616&type=381"
page = urlopen(sub_url)
soup = BeautifulSoup(page.read())
table = soup.find("table", style = "width:100%;overflow:scroll")

for row in table.find_all("tr"):
cells = row.find_all("td")
if len(cells) == 42:
    first_column = cells[0]
    print first_column

html的结构不允许简单地改变＆＃34; tr＆＃34;和＆＃34; td＆＃34;。如何以行方式提取数据？

PS。我想要类似的东西 uiks = [УИК 101, УИК 102, УИК 103, ...]

Answer 1

您已经按行排名了。请注意：for row in...后面的代码必须缩进。

怎么样？

rows = list()
for x, row in enumerate(table.find_all("tr")):
    cells = row.find_all("td")
    if len(cells) == 42:
        rows[x] = cells
        print rows[x]

或者你可能想要创建一个包含条目的嵌套dict，使用＆＃34; uiks＆＃34;作为第一顺序键。这在很大程度上取决于您的使用案例。

Answer 2

您只想打印表格的第一行，因此您必须迭代所有单元格并在第一行后停止。但是你想要的是第一个子节点的文本，因为<td＆gt;内容类似于<td align="center" style="color:black"><nobr>УИК №101</nobr></td>。

所以可能的代码是：

for row in table.find_all("tr"):
    cells = row.find_all("td")
    if len(cells) == 42:
        for cell in cells: # iterate all td cells
            print cell.findChild().txt,    # print all in same line
        print ''                           # only one newline
        break                              # only first row

python中的HTML解析表使用beautifulsoup以行方式

2 个答案: