我正在尝试从网页上抓取数据,并且也能够抓取。 使用下面的脚本获取所有div类数据后,但是我很困惑如何在CSV文件中写入数据。
名字列中的名字数据 姓氏列中的姓氏数据 。
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'
page = urlopen(html)
data = BeautifulSoup(page, 'html.parser')
name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b
for i in range(len(name_box)):
data = name_box[i].text.strip()
数据:
Information Type
Individual
First Name
KACHAM
Middle Name
Last Name
RAJESHWAR
Father Full Name
RAMAIAH
Do you have any Past Experience ?
No
Do you have any registration in other State than registred State?
No
House Number
8-2-293/82/A/446/1
Building Name
SAI KRUPA
Street Name
ROAD NO 20
Locality
JUBILEE HILLS
Landmark
JUBILEE HILLS
State
Telangana
Division
Division 1
District
Hyderabad
Mandal
Shaikpet
Village/City/Town
Pin Code
500033
Office Number
04040151614
Fax Number
Website URL
Authority Name
Plan Approval Number
1/18B/06558/2018
Project Name
SKV S ANANDA VILAS
Project Status
New Project
Proposed Date of Completion
17/04/2024
Litigations related to the project ?
No
Project Type
Residential
Are there any Promoter(Land Owner/ Investor) (as defined by Telangana RERA Order) in the project ?
Yes
Sy.No/TS No.
00
Plot No./House No.
10-2-327
Total Area(In sqmts)
526.74
Area affected in Road widening/FTL of Tanks/Nala Widening(In sqmts)
58.51
Net Area(In sqmts)
1
Total Building Units (as per approved plan)
1
Proposed Building Units(as per agreement)
1
Boundaries East
PLOT NO 213
Boundaries West
PLOT NO 215
Boundaries North
PLOT NO 199
Boundaries South
ROAD NO 8
Approved Built up Area (In Sqmts)
1313.55
Mortgage Area (In Sqmts)
144.28
State
Telangana
District
Hyderabad
Mandal
Maredpally
Village/City/Town
Street
ROAD NO 8
Locality
SECUNDERABAD COURT
Pin Code
500026
上面是在代码上方运行后得到的数据。
修改
for i in range(len(name_box)):
data = name_box[i].text.strip()
print (data)
fname = 'out.csv'
with open(fname) as f:
next(f)
for line in f:
head = []
value = []
for row in line:
head.append(row)
print (row)
期望
Information Type | First | Middle Name | Last Name | ......
Individual | KACHAM | | RAJESHWAR | .....
我有200个网址,但所有网址数据都不相同,这意味着其中一些缺失。如果数据不可用,我想用这种方式写,然后写空的注解。
请提出建议。预先谢谢你
答案 0 :(得分:1)
要写入csv,您需要知道head和body应该是什么值,在这种情况下head值应该是html元素包含<label
from urllib2 import urlopen
from bs4 import BeautifulSoup
html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'
page = urlopen(html)
data = BeautifulSoup(page, 'html.parser')
name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b
heads = []
values = []
for i in range(len(name_box)):
data = name_box[i].text.strip()
dataHTML = str(name_box[i])
if 'PInfoType' in dataHTML:
# <div class="col-md-3 col-sm-3" id="PInfoType">
# empty value, maybe additional data for "Information Type"
continue
if 'for="2"' in dataHTML:
# <label for="2">No</label>
# it should be head but actually value
values.append(data)
elif '<label' in dataHTML:
# <label for="PersonalInfoModel_InfoTypeValue">Information Type</label>
# head or top row
heads.append(data)
else:
# <div class="col-md-3 col-sm-3">Individual</div>
# value for second row
values.append(data)
csvData = ', '.join(heads) + '\n' + ', '.join(values)
with open("results.csv", 'w') as f:
f.write(csvData)
print "finish."
答案 1 :(得分:0)
问题:如何从抓取的数据中写入csv文件
将Data
读入dict
并使用csv.DictWriter(...
写入CSV文件。
有关以下文件:
csv.DictWriter
while
next
break
Mapping Types — dict
Data
行
key = next(data)
value = next(data)
dict[key] = value
dict
写入CSV文件输出:
{'Individual': '', 'Father Full Name': 'RAMAIAH', 'First Name': 'KACHAM', 'Middle Name': '', 'Last Name': 'RAJESHWAR',... (omitted for brevity)