Question

我正在尝试在csv中的一系列网址上删除日期，然后将日期输出到新的CSV。

我有基本的python代码，但是无法弄清楚如何加载CSV（而不是从数组中拉出来）并抓取每个网址然后将其输出到新的CSV。从阅读几个帖子我想我会想要使用csv python模块，但无法让它工作。

这是我的抓取部分代码

import urllib
import re

exampleurls =["http://www.domain1.com","http://www.domain2.com","http://www.domain3.com"]

i=0
while i<len(exampleurls):
    url = exampleurls[i]
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = 'on [0-9][0-9]\.[0-9][0-9]\.[0-9][0-9]'
    pattern = re.compile(regex)
    date = re.findall(pattern,htmltext)
    print date
    i+=1

非常感谢任何帮助！

Answer 1

如果您的csv看起来像这样：

"http://www.domain1.com","other column","yet another"
"http://www.domain2.com","other column","yet another"
...

像这样提取域名：

import urllib
import csv

with open('urlFile.csv') as f:
    reader = csv.reader(f)

    for rec in reader:
        htmlfile = urllib.urlopen(rec[0])
        ...

如果你的网址文件看起来像这样：

http://www.domain1.com
http://www.domain2.com
...

你可以用这样的列表理解做更酷的事情：

urls = [x for x in open('urlFile')]

编辑：回复评论

您可以在python中打开文件，如：

f = open('myurls.csv', 'w')
...
for rec in reader:
    ...
    f.write(urlstring)
f.close()

或者，如果您使用的是unix / linux，只需在代码中使用print，然后使用bash：

python your_scraping_script.py > someoutfile.csv

从csv中的URL刮HTML，然后使用python打印到csv

1 个答案: