I'm a newbie working on a web scraping project. I need to get these election results into a dataframe (or Excel) in order to analyze it.
What's been most tricky is that it is a .htm file with all the data as one big text block in between "Preformatted Text" (PRE) tags, and no individual tags on the data itself. I am only interested in the parts of the data that are set up like tables:
https://www.stlouisco.com/portals/8/docs/document%20library/elections/eresults/el140805/EXEC.htm
I have been attempting it in Python with BeautifulSoup. However, if you view the source code at the URL, you can see why BeautifulSoup isn't getting me very far -- because the data isn't structured using tags. The structure looks like this, basically:
<html>
<pre>
COUNTY EXECUTIVE PRIMARY ELECTION OFFICIAL FINAL RESULTS
ST. LOUIS COUNTY, MISSOURI
RUN DATE:08/18/14 01:20 PM TUESDAY, AUGUST 5, 2014
STATISTICS
WITH 681 OF 681 PRECINCTS REPORTING
TOTAL PERCENT TOTAL PERCENT
01 = REGISTERED VOTERS - TOTAL 661,393 05 = BALLOTS CAST - LIBERTARIAN 1,121 .58
02 = BALLOTS CAST - TOTAL 192,495 06 = BALLOTS CAST - CONSTITUTION 314 .16
03 = BALLOTS CAST - DEMOCRATIC 129,918 67.49 07 = BALLOTS CAST - NONPARTISAN 6,225 3.23
04 = BALLOTS CAST - REPUBLICAN 54,917 28.53 08 = VOTER TURNOUT - TOTAL 29.10
- - - - - - - - - - - - - - - - - - - - - - - -
01 02 03 04 05 06 07 08
- - - - - - - - - - - - - - - - - - - - - - - -
0101 AP1,2,7,43 1317 . 298 . 214 . 69 . . 3 . . 1 . 11 22.63
0103 AP3,27 NRW2,8,15,29 1453 . 186 . 179 . . 5 . . 1 . . 0 . . 1 12.80
0104 AP4 231 . 51 . 34 . . 4 . . 0 . . 0 . 13 22.08
0105 AP5,18,21,39 1289 . 268 . 198 . 47 . . 4 . . 1 . 18 20.79
0106 AP6 2 . . 1 . . 0 . . 0 . . 0 . . 0 . . 1 50.00
0108 AP8,20 586 . 142 . 86 . 44 . . 4 . . 0 . . 8 24.23
0109 AP9,25 533 . 119 . 85 . 29 . . 2 . . 3 . . 0 22.33
0110 AP10 1044 . 158 . 114 . 34 . . 2 . . 0 . . 8 15.13
...
2832 WH32,38,44 296 . 51 . 23 . 28 . . 0 . . 0 . . 0 17.23
2834 WH34,43 2043 . 609 . 267 . 321 . . 1 . . 0 . 20 29.81
2835 WH35 543 . 173 . 60 . 110 . . 0 . . 0 . . 3 31.86
====================================================================================================================================
(DEMOCRATIC) WITH 681 OF 681 REPORTING
VOTES PERCENT VOTES PERCENT
COUNTY EXECUTIVE
(Vote for ) 1
01 = CHARLIE A. DOOLEY 39,038 30.52
02 = STEVE STENGER 84,993 66.46 03 = RONALD E. LEVY 3,862 3.02
------------------
01 02 03
------------------
0101 AP1,2,7,43 59 134 19
0103 AP3,27 NRW2,8,15,29 154 18 5
0104 AP4 7 25 2
0105 AP5,18,21,39 55 133 9
0106 AP6 0 0 0
0108 AP8,20 28 50 7
0109 AP9,25 21 57 6
0110 AP10 56 54 1
0111 AP11,24 53 54 1
0112 AP12 19 41 1
0113 AP13 23 46 2
0114 AP14,15,16 NOR31 25 56 4
...
2819 WH19,20,22 25 162 7
2825 WH25 17 109 9
2831 WH31 18 112 7
2832 WH32,38,44 0 22 1
2834 WH34,43 31 218 10
2835 WH35 16 41 3
====================================================================================================================================
(REPUBLICAN) WITH 681 OF 681 REPORTING
VOTES PERCENT
COUNTY EXECUTIVE
(Vote for ) 1
01 = TONY POUSOSA 16,439 32.10
02 = RICK STREAM 34,772 67.90
------------
01 02
------------
0101 AP1,2,7,43 24 37
0103 AP3,27 NRW2,8,15,29 1 4
0104 AP4 1 3
0105 AP5,18,21,39 13 28
0106 AP6 0 0
0108 AP8,20 16 28
0109 AP9,25 9 19
0110 AP10 13 19
0111 AP11,24 7 32
...
</pre>
<p>Some closing text that is irrelevant to this project.</p>
</html>
I am hoping to use Python to automate this process so I can run it on other similar webpages of election results.
Here is as far as I've been able to get. I was able to create a list of objects with each list item being one line of the data. I would like it to become a data frame with all the extra spaces and periods stripped out. I'm not sure how to do that from here, though. I imagine I may even be thinking about this from the wrong angle.
# STEP 1: Importing the Libraries
import requests
from bs4 import BeautifulSoup
# STEP 2: Collecting and Parsing the webpage
# Collect the election results page
page = requests.get('https://www.stlouisco.com/portals/8/docs/document%20library/elections/eresults/el140805/EXEC.htm')
# Parse the page and create a Beautiful Soup object
soup = BeautifulSoup(page.text, 'html.parser')
# STEP 3: Create an object with just the text
soup2 = soup.text
# Split the text at each line break \n; this creates a list object
[x.strip() for x in soup2.split('\n')]
Output:
[...
'0212 BON12 1678 . 685 . 376 . 295 . . 1 . . 0 . 13 40.82',
'0213 BON13,23,26,29 2174 . 796 . 500 . 261 . . 3 . . 2 . 30 36.61',
'0214 BON14 17 . . 4 . . 0 . . 0 . . 0 . . 0 . . 4 23.53',
'0215 BON15 1340 . 369 . 224 . 129 . . 2 . . 1 . 13 27.54',
'0216 BON16 204 . 104 . 68 . 36 . . 0 . . 0 . . 0 50.98',
'0217 BON17 589 . 93 . 71 . 16 . . 1 . . 0 . . 5 15.79',
'0218 BON18 195 . 48 . 28 . 17 . . 0 . . 1 . . 2 24.62',
'0219 BON19 CLA15 1340 . 443 . 255 . 172 . . 5 . . 0 . 11 33.06',
...]
I feel stuck and I would greatly appreciate any advice! (And if Python isn't the best way to automate getting this into a dataframe... I welcome that feedback too.) Thank you.
答案 0 :(得分:1)
Looking at the example you've cited, you'll need to write a parser since your data is complex and varies across each line (and most likely, each page).
Using this line as an example, hopefully I can explain why:
0101 AP1,2,7,43 1317 . 298 . 214 . 69 . . 3 . . 1 . 11 22.63
0101
is relatively consistent across each line, as this appears to be some sort of integer index that's zero-padded. This is followed by 1 space.AP1,2,7,43
) follows certain rules but its content varies. For e.g., we know that the number of comma-separated values varies across each line, and that the values sometimes it can contain whitespace (e.g. AP3,27 NRW2,8,15,29
). This is then followed by a lot of whitespace up to the next section - i.e. what appears to be voting numbers.". "
delimiter is repeated and placed in the hundreds position. 22.63
is a regular floating point number with 2 decimal places.This does not yet touch the other lines which each have their own rules.
Given the complexity of your dataset, you're better off writing a simple grammar using tools like pyparsing or PLY to create mini-parsers that can automatically extract the information from each line, which can then be placed in a data-structure and saved to a dataframe. A good example using pyparsing which is applicable here, shows how to parse street addresses. More examples can be found here.
Notably, all of this could be dealt with by writing custom text manipulation functions and code, but given that you intend to automate things, a parser is your best bet since it will be reusable and more adaptable.
答案 1 :(得分:0)
Looking at the file format, I think this should be pretty doable. The lines you need seem to have a fixed character size. If you read each line, you can split them on character length.
first 5 characters: number (e.g. 0101)
next 32 characters: description (e.g. AP3,27 NRW2,8,15,29 )
next 6 characters: column 01
and so on. But keep in mind if the format of this file changes, your code will be broken.