Question

我正在尝试构建一个快速的小脚本，以从网站上从网站抓取数据并将结果保存到格式化的CSV中。

到目前为止，使用BeautifulSoup并已经能够从网站上获取我想要的数据，对其进行编码，以便可以将其保存为CSV，但是其格式很长，没有逻辑格式（我可以看到），我我试图弄清楚如何转换。

代码：＃导入库导入urllib2 从bs4导入BeautifulSoup

check.length

当前输出：

import csv
from datetime import datetime

# specify the url
quote_page = 'url'

# query the website and return the html to the variable 'page'
page = urllib2.urlopen(quote_page)

# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')

# Take out the <div> of name and get its value
name_box = soup.find('ul', attrs={'id': 'list-store-detail'})

name = name_box.text.strip() # strip() is used to remove starting and trailing
print name

# open a csv file with append, so old data will not be erased
with open('index.csv', 'a') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([name.encode('utf-8')])

Desired Output:

如您所见，它们之间有巨大的空白，据我所知，它实际上没有任何\ n \ r。

我假设我将不得不以某种方式将字符串分成几行，进行循环，然后将其正确格式化为CSV？

任何帮助将不胜感激。

Answer 1

您的假设是正确的！可能有一种更有效的方法来执行此操作，但这需要很少的代码更改。

使用

分割字符串

split_name = name.split("\n")

摆脱空白行

no_blanks = [ x for x in split_name if len(x) > 0 ]

写入CSV

with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
line = []
for i in range(len(no_blanks)):
    line.append(no_blanks[I].strip())
    if len(line) == 8:
        writer.writerow(line)
        line = []

输出

Name,Address 1,Address 2,Country,Name + Address,Phone Number,Street View,Direction Name,Address 1,Address 2,Country,Name + Address,Phone Number,Street View,Direction

将网页抓取的字符串列表转换为格式化的CSV

1 个答案: