Question

我需要获得大量特定于rivercruising的数据，因此我正在使用alteryx，并且为了抓取我想从命令行使用python。我需要将输出文件写入json或csv。输出文件为空。代码中的主题标签用于处理alteryx中的输出文件，因为已删除的文本已包含“，”。我最好将输出映射到Json。我的代码如下：

from mechanize import Browser
from bs4 import BeautifulSoup
import lxml

mech = Browser()

url = 'http://www.cruiseshipschedule.com/viking-river-cruises/viking-aegir-schedule/'
page = mech.open(url)

html = page.read()
html.replace('charset="ISO-8859-1"','charset=utf-8')
s = BeautifulSoup(html, "lxml")
content = s.findAll('div', id="content")
link = s.findAll("a")
h1 = s.findAll("h1")

table = s.findAll("table", border="1")

for link in s.findAll("a"):
    linktext = link.text
    linkhref = link.get("href")

for h1 in s.findAll("h1"):
    ship = h1.text

h2_1 = s.h2
h2_1.text
h2_2 = h2_1.find_next('h2')
itinerary_1 = h2_2.text
h2_3 = h2_2.find_next('h2')
itinerary_2 = h2_3.text
h2_4 = h2_3.find_next('h2')
itinerary_3 = h2_4.text

for table in content:
    table0 = s.findAll("table", border='0')

    for tr in s.findAll("table", border='1'):
        trs1 = s.findAll("tr")
        table1 = tr.text.replace('\n','|')
        tds1 = s.findAll('td')
        uls1 = s.findAll('ul')
        lis1 = s.findAll('li')



    for tr in s.findAll("table", border='0'):
        trs2 = s.findAll("tr")
        table2 = tr.text.replace('\n','|')
        tds2 = s.findAll('td')
        uls2 = s.findAll('ul')
        lis2 = s.findAll('li')

all_data=ship+"#"+table1+"#"+table2+"#"+itinerary_1+"#"+itinerary_2+"#"+itinerary_3


all_data = open("Z:/txt files/all_data.txt", "w")
print all_data >> "Z:/txt files/all_data.txt"

Answer 1

要获得输出到您的文件，请尝试上面代码中的最后两行代码：

with open('all_data_txt, 'w') as f:
    f.write(all_data.encode('utf8'))

使用mechanize和beautifulsoup进行Webscraping - 无法写入输出文件

1 个答案: