我正试图从一个网站上抓取数据,该网站包含多个用数字表示的页面上印度所有政客的数据。
url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1
我希望将数据从多个网站导出为CSV文件。
这是我正在尝试的示例表:
<tr>
<td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>
<td class=chartcell align=center>53</td>
<td class=chartcell align=center>M</td>
<td class=chartcell align=center>IND</td>
<td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>
<td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>
<td class=chartcell align=center>1</td>
<td class=chartcell align=left> <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>
<td class=chartcell align=center>Graduate</td>
<td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19 Thou+</span></td>
<td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3 Lacs+</span></td>
<td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3 Lacs+</span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>N</td>
<!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2 Lacs+</span></td> -->
</tr>
我已经使用BeautifulSoup来获取数据,但是如果我打开CSV数据,它们就会以某种方式合并并看起来非常笨拙。
这是我的代码:
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALL\n')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text.ljust(250))
r.write('\n')
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
此外,我如何省略特定td中的数据? 就我而言,我不想从表中获取IPC详细信息的数据。
我对编码和python还是很陌生。
答案 0 :(得分:0)
由于“ PIC详细信息”列总是排在第七位,因此您可以将其切出:
import csv
import requests
import time
from bs4 import BeautifulSoup
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALL\n')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
cells = list(map(lambda cell: cell.text, row.find_all('td')))
cells = cells[:7] + cells[8:]
writer = csv.writer(r, delimiter='\t')
writer.writerow(cells)
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
答案 1 :(得分:0)
我认为“合并数据问题”是由于您实际上并没有使用逗号分隔单元格所致。在常规文本编辑器上检查您的csv生成文件,以查看该文件。
一个简单的解决方案是使用join
方法创建一个用逗号分隔的带有单元格列表的字符串,并将其打印到文件中。例如:
content = [cell.text for cell in row.find_all('td')]
r.write(';'.join(content)+'\n')
在第一行中,我使用了所谓的“列表理解”,这对您学习非常有用。这允许使用一行代码而不是执行“ for”循环来迭代列表中的所有元素。在第二行中,我对字符串join
使用;
方法。这意味着将数组content
转换为一个字符串,该字符串将所有元素与;
连接起来。最后,我添加了换行符。
如果您要基于索引省略元素(例如,省略第7列),我们可以使列表理解有点复杂:
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'\n')
在ommit_columns
中,您可以编写多个索引。在下面的列表理解中,我们使用方法enumerate
从row.find_all('td')
获取所有索引和元素,然后过滤它们以检查index
是否不在ommit_columns
数组中。
完整的代码应为:
from bs4 import BeautifulSoup
import time
import requests
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALL\n')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
# content = [cell.text for cell in row.find_all('td')]
# r.write(';'.join(content)+'\n')
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'\n')
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
,响应如下:
POLITICIANS ALL
Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N
Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y
Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y
Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y
Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N
Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N
Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y
Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y