Python打开并逐个解析url的a.txt

时间:2017-01-03 09:29:10

标签: python loops parsing urllib2

我猜这是一个简单而基本的问题。但我没有找到一个清晰简单的答案。 这是我的问题:

我有一个.txt文件,每行都有网址(大约300)。我从python脚本中获取了这些url。 我想逐个打开这些网址并为每个网址执行此脚本以获取我感兴趣的一些信息:

import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.aerodromes.fr/aeroport-de-saint-martin-grand-case-ttfg-a413.html")
soup = BeautifulSoup(page, "html.parser")
info_tag = soup.find_all('b')
info_nom =info_tag[2].string
info_pos =info_tag[4].next_sibling
info_alt =info_tag[5].next_sibling
info_pis =info_tag[6].next_sibling
info_vil =info_tag[7].next_sibling
print(info_nom +","+ info_pos+","+ info_alt +","+ info_pis +","+info_vil)

aero-url.txt

http://www.aerodromes.fr/aeroport-de-la-reunion-roland-garros-fmee-a416.html,
http://www.aerodromes.fr/aeroport-de-saint-pierre---pierrefonds-fmep-a417.html,
http://www.aerodromes.fr/base-aerienne-de-moussoulens-lf34-a433.html,
http://www.aerodromes.fr/aerodrome-d-yvetot-lf7622-a469.html,
http://www.aerodromes.fr/aerodrome-de-dieppe---saint-aubin-lfab-a1.html,
http://www.aerodromes.fr/aeroport-de-calais---dunkerque-lfac-a2.html,
http://www.aerodromes.fr/aerodrome-de-compiegne---margny-lfad-a3.html,
http://www.aerodromes.fr/aerodrome-d-eu---mers---le-treport-lfae-a4.html,
http://www.aerodromes.fr/aerodrome-de-laon---chambry-lfaf-a5.html,
http://www.aerodromes.fr/aeroport-de-peronne---saint-quentin-lfag-a6.html,
http://www.aerodromes.fr/aeroport-de-nangis-les-loges-lfai-a7.html,
...

我想我必须使用这样的循环:

import urllib2
from bs4 import BeautifulSoup

# Open the file for reading
infile = open("aero-url.txt", 'r')

# Read every single line of the file into an array of lines
lines = infile.readline().rstrip('\n\r')

for line in infile

page = urllib2.urlopen(lines)
soup = BeautifulSoup(page, "html.parser")

#find the places of each info
info_tag = soup.find_all('b')
info_nom =info_tag[2].string
info_pos =info_tag[4].next_sibling
info_alt =info_tag[5].next_sibling
info_pis =info_tag[6].next_sibling
info_vil =info_tag[7].next_sibling

#Print them on the terminal.
print(info_nom +","+ info_pos+","+ info_alt +","+ info_pis +","+info_vil)

之后我会将这些结果写在txt文件中。但我的问题是如何将我的解析脚本应用到我的网址文本文件。

1 个答案:

答案 0 :(得分:0)

在urlopen中使用line代替lines

page = urllib2.urlopen(line)

由于您在循环中使用infile,因此不需要lines

lines = infile.readline().rstrip('\n\r')

循环中的缩进也是错误的。
纠正这些代码应该如下所示。

import urllib2
from bs4 import BeautifulSoup

# Open the file for reading
infile = open("aero-url.txt", 'r')

for line in infile:

    page = urllib2.urlopen(line)
    soup = BeautifulSoup(page, "html.parser")

    #find the places of each info
    info_tag = soup.find_all('b')
    info_nom =info_tag[2].string
    info_pos =info_tag[4].next_sibling
    info_alt =info_tag[5].next_sibling
    info_pis =info_tag[6].next_sibling
    info_vil =info_tag[7].next_sibling

    #Print them on the terminal.
    print(info_nom +","+ info_pos+","+ info_alt +","+ info_pis +","+info_vil)