来自<pre> tag

时间:2017-01-29 08:43:32

标签: python html python-2.7 file-io

So I am writing some code in Python 2.7 to pull some information from a website, pull the relevant data from that set, then format that data in a way that is more useful. Specifically, I am wanting to take information from a html <pre> tag, put it into a file, turn that information in the file into an array (using numpy), and then do my analysis from that. I am stuck on the "put into a file" part. It seems that when I put it into a file, it is a 1x1 matrix or something and so it won't do what I hope it will. On an attempt previous to the code sample below, the error I got was: IndexError: index 5 is out of bounds for axis 0 with size 0 I had the index for array just to test if it would provide output from what I have so far.

Here is my code so far:

#Pulling data from GFS lamps

from lxml import html
import requests
import numpy as np

ICAO = raw_input("What station would you like GFS lamps data for? ")

page = requests.get('http://www.nws.noaa.gov/cgi-bin/lamp/getlav.pl?sta=' + ICAO)
tree = html.fromstring(page.content)
Lamp = tree.xpath('//pre/text()') #stores class of //pre html element in list Lamp
gfsLamps = open('ICAO', 'w') #stores text of Lamp into a new file
gfsLamps.write(Lamp[0])

array = np.genfromtxt('ICAO') #puts file into an array

array[5]

You can use KOGD as the ICAO to test this. As is, I get Value Error: Some Errors were detected and it lists Lines 2-23 (Got 26 columns instead of 8). What is the first step that I am doing wrong for what I want to do? Or am I just going about this all wrong?

1 个答案:

答案 0 :(得分:0)

问题不在于将数据放入文件部分,而是使用genfromtxt将其输出。问题是genfromtxt是一个非常严格的函数,除非你指定很多选项来跳过列和行,否则大多数都需要完整的数据。改为使用这个:

arrays = [np.array(map(str, line.split())) for line in open('ICAO')]

数组变量将包含每行的数组,其中包含由空格分隔的该行中的每个单独元素,例如,如果您的行具有以下数据:

a b cdef 124

此行的数组将为:

['a','b','cdef','124']

数组将包含这样的每一行的数组,可以根据需要进一步处理。 所以完整的代码是:

from lxml import html
import requests
import numpy as np

ICAO = raw_input("What station would you like GFS lamps data for? ")

page = requests.get('http://www.nws.noaa.gov/cgi-bin/lamp/getlav.pl?sta=' + ICAO)
tree = html.fromstring(page.content)
Lamp = tree.xpath('//pre/text()') #stores class of //pre html element in list Lamp
gfsLamps = open('ICAO', 'w') #stores text of Lamp into a new file
gfsLamps.write(Lamp[0])
gfsLamps.close()
array = [np.array(map(str, line.split())) for line in open('ICAO')]
print array