使用<pre> preformatted text and no tags

时间:2018-02-03 07:34:04

标签: python html dataframe web-scraping beautifulsoup

I'm a newbie working on a web scraping project. I need to get these election results into a dataframe (or Excel) in order to analyze it.

What's been most tricky is that it is a .htm file with all the data as one big text block in between "Preformatted Text" (PRE) tags, and no individual tags on the data itself. I am only interested in the parts of the data that are set up like tables:

https://www.stlouisco.com/portals/8/docs/document%20library/elections/eresults/el140805/EXEC.htm

I have been attempting it in Python with BeautifulSoup. However, if you view the source code at the URL, you can see why BeautifulSoup isn't getting me very far -- because the data isn't structured using tags. The structure looks like this, basically:

<html>
<pre>
COUNTY EXECUTIVE                                  PRIMARY ELECTION                                   OFFICIAL FINAL RESULTS
                                              ST. LOUIS COUNTY, MISSOURI
RUN DATE:08/18/14 01:20 PM                        TUESDAY, AUGUST 5, 2014
                                              STATISTICS
                                                                 WITH 681 OF 681 PRECINCTS REPORTING
                                               TOTAL  PERCENT                                                       TOTAL  PERCENT
   01 = REGISTERED VOTERS - TOTAL                661,393             05 = BALLOTS CAST - LIBERTARIAN                 1,121     .58
   02 = BALLOTS CAST - TOTAL                     192,495             06 = BALLOTS CAST - CONSTITUTION                  314     .16 
   03 = BALLOTS CAST - DEMOCRATIC                129,918   67.49     07 = BALLOTS CAST - NONPARTISAN                 6,225    3.23
   04 = BALLOTS CAST - REPUBLICAN                 54,917   28.53     08 = VOTER TURNOUT - TOTAL                              29.10
                                     - - - - - - - - - - - - - - - - - - - - - - - -
                                       01    02    03    04    05    06    07    08
                                     - - - - - - - - - - - - - - - - - - - - - - - -
0101 AP1,2,7,43                      1317 . 298 . 214 .  69 . . 3 . . 1 .  11 22.63
0103 AP3,27 NRW2,8,15,29             1453 . 186 . 179 . . 5 . . 1 . . 0 . . 1 12.80
0104 AP4                              231 .  51 .  34 . . 4 . . 0 . . 0 .  13 22.08
0105 AP5,18,21,39                    1289 . 268 . 198 .  47 . . 4 . . 1 .  18 20.79
0106 AP6                                2 . . 1 . . 0 . . 0 . . 0 . . 0 . . 1 50.00
0108 AP8,20                           586 . 142 .  86 .  44 . . 4 . . 0 . . 8 24.23
0109 AP9,25                           533 . 119 .  85 .  29 . . 2 . . 3 . . 0 22.33
0110 AP10                            1044 . 158 . 114 .  34 . . 2 . . 0 . . 8 15.13

...

2832 WH32,38,44                       296 .  51 .  23 .  28 . . 0 . . 0 . . 0 17.23
2834 WH34,43                         2043 . 609 . 267 . 321 . . 1 . . 0 .  20 29.81
2835 WH35                             543 . 173 .  60 . 110 . . 0 . . 0 . . 3 31.86
    ====================================================================================================================================
              (DEMOCRATIC)                                           WITH   681 OF 681  REPORTING
                                               VOTES  PERCENT                                                    VOTES  PERCENT
COUNTY EXECUTIVE
  (Vote for )  1
   01 = CHARLIE A. DOOLEY                         39,038   30.52
   02 = STEVE STENGER                             84,993   66.46     03 = RONALD E. LEVY                             3,862    3.02
                                   ------------------
                                       01    02    03
                                   ------------------
0101 AP1,2,7,43                        59   134    19
0103 AP3,27 NRW2,8,15,29              154    18     5
0104 AP4                                7    25     2
0105 AP5,18,21,39                      55   133     9
0106 AP6                                0     0     0
0108 AP8,20                            28    50     7
0109 AP9,25                            21    57     6
0110 AP10                              56    54     1
0111 AP11,24                           53    54     1
0112 AP12                              19    41     1
0113 AP13                              23    46     2
0114 AP14,15,16 NOR31                  25    56     4

...

2819 WH19,20,22                        25   162     7
2825 WH25                              17   109     9
2831 WH31                              18   112     7
2832 WH32,38,44                         0    22     1
2834 WH34,43                           31   218    10
2835 WH35                              16    41     3
====================================================================================================================================
                  (REPUBLICAN)                                           WITH 681 OF 681  REPORTING
                                               VOTES  PERCENT
COUNTY EXECUTIVE
  (Vote for )  1
   01 = TONY POUSOSA                              16,439   32.10
   02 = RICK STREAM                               34,772   67.90
                                   ------------
                                       01    02
                                   ------------
0101 AP1,2,7,43                        24    37
0103 AP3,27 NRW2,8,15,29                1     4
0104 AP4                                1     3
0105 AP5,18,21,39                      13    28
0106 AP6                                0     0
0108 AP8,20                            16    28
0109 AP9,25                             9    19
0110 AP10                              13    19
0111 AP11,24                            7    32

...

</pre>
<p>Some closing text that is irrelevant to this project.</p>
</html>

I am hoping to use Python to automate this process so I can run it on other similar webpages of election results.

Here is as far as I've been able to get. I was able to create a list of objects with each list item being one line of the data. I would like it to become a data frame with all the extra spaces and periods stripped out. I'm not sure how to do that from here, though. I imagine I may even be thinking about this from the wrong angle.

# STEP 1: Importing the Libraries

import requests
from bs4 import BeautifulSoup


# STEP 2: Collecting and Parsing the webpage

# Collect the election results page
page = requests.get('https://www.stlouisco.com/portals/8/docs/document%20library/elections/eresults/el140805/EXEC.htm')

# Parse the page and create a Beautiful Soup object
soup = BeautifulSoup(page.text, 'html.parser')


# STEP 3: Create an object with just the text
soup2 = soup.text

# Split the text at each line break \n; this creates a list object
[x.strip() for x in soup2.split('\n')]


Output: 

[...
'0212 BON12                           1678 . 685 . 376 . 295 . . 1 . . 0 .  13 40.82',
'0213 BON13,23,26,29                  2174 . 796 . 500 . 261 . . 3 . . 2 .  30 36.61',
'0214 BON14                             17 . . 4 . . 0 . . 0 . . 0 . . 0 . . 4 23.53',
'0215 BON15                           1340 . 369 . 224 . 129 . . 2 . . 1 .  13 27.54',
'0216 BON16                            204 . 104 .  68 .  36 . . 0 . . 0 . . 0 50.98',
'0217 BON17                            589 .  93 .  71 .  16 . . 1 . . 0 . . 5 15.79',
'0218 BON18                            195 .  48 .  28 .  17 . . 0 . . 1 . . 2 24.62',
'0219 BON19 CLA15                     1340 . 443 . 255 . 172 . . 5 . . 0 .  11 33.06',
...]

I feel stuck and I would greatly appreciate any advice! (And if Python isn't the best way to automate getting this into a dataframe... I welcome that feedback too.) Thank you.

2 个答案:

答案 0 :(得分:1)

Looking at the example you've cited, you'll need to write a parser since your data is complex and varies across each line (and most likely, each page).

Using this line as an example, hopefully I can explain why:

0101 AP1,2,7,43                      1317 . 298 . 214 .  69 . . 3 . . 1 .  11 22.63
  1. This part: 0101 is relatively consistent across each line, as this appears to be some sort of integer index that's zero-padded. This is followed by 1 space.
  2. However, the next portion (AP1,2,7,43) follows certain rules but its content varies. For e.g., we know that the number of comma-separated values varies across each line, and that the values sometimes it can contain whitespace (e.g. AP3,27 NRW2,8,15,29). This is then followed by a lot of whitespace up to the next section - i.e. what appears to be voting numbers.
  3. For these columns of numbers / integers, each column is separated by a whitespace followed by a combined separator of a dot and a space. If the integers are less than 10, the number is padded such that the ". " delimiter is repeated and placed in the hundreds position.
  4. The last column, 22.63 is a regular floating point number with 2 decimal places.

This does not yet touch the other lines which each have their own rules.

Given the complexity of your dataset, you're better off writing a simple grammar using tools like pyparsing or PLY to create mini-parsers that can automatically extract the information from each line, which can then be placed in a data-structure and saved to a dataframe. A good example using pyparsing which is applicable here, shows how to parse street addresses. More examples can be found here.

Notably, all of this could be dealt with by writing custom text manipulation functions and code, but given that you intend to automate things, a parser is your best bet since it will be reusable and more adaptable.

答案 1 :(得分:0)

Looking at the file format, I think this should be pretty doable. The lines you need seem to have a fixed character size. If you read each line, you can split them on character length.

first 5 characters: number (e.g. 0101)

next 32 characters: description (e.g. AP3,27 NRW2,8,15,29 )

next 6 characters: column 01

and so on. But keep in mind if the format of this file changes, your code will be broken.