在python 2.7中使用lxml在网页抓取中缺少一列和redundent空格/换行符

时间:2015-12-08 19:21:25

标签: python csv web-scraping lxml

我试图在python中抓取this page以将该页面中的最大表格变为csv。我大部分时间都在回答here

但我面临两个问题:

  • 缺少行使价
  • 由于包含大量" \ r"的异常字符串,将数据写入csv未对齐。并以单个" \ n"结尾。这会在from urllib2 import Request, urlopen from lxml import etree import csv ourl = "http://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?segmentLink=17&instrument=OPTIDX&symbol=NIFTY&date=31DEC2015" headers = {'Accept' : '*/*', 'Accept-Language' : 'en-US,en;q=0.5', 'Host': 'nseindia.com', 'Referer': 'http://www.nseindia.com/live_market/dynaContent/live_market.htm', 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/35.0', 'X-Requested-With': 'XMLHttpRequest'} req = Request(ourl, None, headers) response = urlopen(req) the_page = response.read() ptree = etree.HTML(the_page) tr_nodes = ptree.xpath('//table[@id="octable"]/tr') header = [i[0].text for i in tr_nodes[0].xpath("th")] td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]] with open("nseoc.csv", "wb") as f: writer = csv.writer(f) writer.writerows(td_content)
  • 中放入大量的空白字符

以下是我正在使用的代码。请帮我解决这两个问题。

$ns_ attach-agent $node(0) $tcp(0)  
$ns_ attach-agent $node(3) $tcpsink(0)
$ns_ connect $tcp(0) $tcpsink(0)
$ftp(0) set rate_ 500Kb
$tcp(0) set packetSize_ 200b
$ftp(0) attach-agent $tcp(0)

$ns_ at 5 "$ftp(0) start"
$ns_ at 100.0 "$ftp(0) stop" # this line is what i'm talking about

1 个答案:

答案 0 :(得分:1)

Writing the data to csv is misaligned due to a aberrant string containing multitudes of "\r" and ending with a single "\n"

First of all, I would use stripplot package, get the lxml.html of every cell and apply text_content() afterwards:

strip()

Here is how from lxml.html import fromstring ptree = fromstring(the_page) tr_nodes = ptree.xpath('//table[@id="octable"]//tr')[1:] td_content = [[td.text_content().strip() for td in tr.xpath('td')] for tr in tr_nodes[1:]] would look:

td_content

Note that the "Strike Price" is there (2700 and 2800).