python网页抓取csv文件

时间:2021-03-05 04:23:22

标签: python beautifulsoup python-requests export-to-csv

这是我的网页抓取代码,用于获取内容并导出到 csv 文件。我可以知道为什么 csv 文件中的每一行都有间距吗?能解决吗?谢谢!

Python 代码

import requests
from bs4 import BeautifulSoup
import csv

session = requests.session()

payload = {"i0023":"XXXXXX", 
          "i0025":"XXXXXX"
         }
         
session.post("http://192.168.XXX.XXX/checkLogin.cgi",data = payload)

s = session.get("http://192.168.XXX.XXX/m_departmentid.html")

soup = BeautifulSoup(s.text, "html.parser")

table = soup.find('div', attrs={ "class" : "ItemListComponent"})
tbody = table.find_all('tbody')

rows = []

for row in table.find_all('tr'):
    rows.append([val.text for val in row.find_all('td')[0:6]])

with open('test.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerows(row for row in rows if row)

源代码

<div class="ItemListComponent">
<table>
<thead>
<tr><th rowspan="3" scope="col">Department ID</th><th colspan="5" scope="col">Page Total/Page Restriction</th><th rowspan="3" scope="col"></th></tr>
<tr><th colspan="3" scope="col">Total Prints</th><th colspan="1" scope="col">Color</th><th colspan="1" scope="col">Black & White</th></tr>
<tr><th colspan="1" scope="col">Total</th><th colspan="1" scope="col">Color</th><th colspan="1" scope="col">Black & White</th><th colspan="1" scope="col">Print</th><th colspan="1" scope="col">Print</th></tr>

</thead>
<tbody>
<tr><td>7654321</td><td>11</td><td>0</td><td>11</td><td>0</td><td>11</td><td></td></tr>
<tr><td><a href="/m_departmentid_edit.html?id=100">0000100</a></td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td><input class="ButtonEnable" type="button" value="Delete" title="Delete" onclick="departmentIdDelete(100)"/><input class="ButtonEnable" type="button" value="Clear Count" onclick="departmentIdClear(100)" />
</td></tr>
<tr><td><a href="/m_departmentid_edit.html?id=101">0000101</a></td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td><input class="ButtonEnable" type="button" value="Delete" title="Delete" onclick="departmentIdDelete(101)"/><input class="ButtonEnable" type="button" value="Clear Count" onclick="departmentIdClear(101)" />
</td></tr>
<tr><td><a href="/m_departmentid_edit.html?id=102">0000102</a></td><td>18</td><td>5</td><td>13</td><td>5</td><td>13</td><td><input class="ButtonEnable" type="button" value="Delete" title="Delete" onclick="departmentIdDelete(102)"/><input class="ButtonEnable" type="button" value="Clear Count" onclick="departmentIdClear(102)" />
</td></tr>
<tr><td><a href="/m_departmentid_edit.html?id=103">0000103</a></td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td><input class="ButtonEnable" type="button" value="Delete" title="Delete" onclick="departmentIdDelete(103)"/><input class="ButtonEnable" type="button" value="Clear Count" onclick="departmentIdClear(103)" />
</td></tr>
<tr><td><a href="/m_departmentid_edit.html?id=104">0000104</a></td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td><input class="ButtonEnable" type="button" value="Delete" title="Delete" onclick="departmentIdDelete(104)"/><input class="ButtonEnable" type="button" value="Clear Count" onclick="departmentIdClear(104)" />
</td></tr>

Figure11

3 个答案:

答案 0 :(得分:1)

您将其打开为“wb”,即写入字节。改为将其打开为“w”。

答案 1 :(得分:0)

您需要对字符串进行编码以将其转换为字节对象。

for row in soup.select(".ItemListComponent tbody tr")[1:215]:
    row_text = [x.text.encode() for x in row.find_all("td")]
    print(",".join(row_text))

答案 2 :(得分:0)

谢谢大家。最后,我找到了解决在 csv writer 中添加换行参数缺失的问题的解决方案。

代码

session = requests.session()

payload = {"i0023":"XXXXX", 
          "i0025":"XXXXX"
         }
         
session.post("http://192.168.XXX.XXX/checkLogin.cgi",data = payload)

s = session.get("http://192.168.XXX.XXX/m_departmentid.html")

soup = BeautifulSoup(s.text, "html.parser")

table = soup.find('div', attrs={ "class" : "ItemListComponent"})
table_tbody = table.find('tbody')

rows = []
 
for row in table.find_all('tr'):
    rows.append([val.text for val in row.find_all('td')])   


with open(("\test.csv"), 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(row for row in rows if row)