我正在尝试使用chromedriver收集数据
我正在使用网址' http://web.mta.info/developers/turnstile.html'获取我的数据,提取文件链接,然后我根据数据的日期将它放在两个表中 这是我试图执行的代码:
record_cnt = 0
for link in data_list_post:
data = pd.read_table(link, sep=',')
print('%s:%s rows %s columns' % (link[-10:-4],data.shape[0], data.shape[1]))
record_cnt += data.shape[0]
data.to_sql(name='post', con=conPost, flavor='sqlite', if_exists='append')
回溯:
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
<ipython-input-9-6f5adea38bf9> in <module>()
3 data = pd.read_table(link, sep=',')
4 record_cnt += data.shape[0]
----> 5 data.to_sql(name='post', con=conPost, flavor='sqlite', if_exists='append')
/Users/xx/anaconda/lib/python3.4/site-packages/pandas/core/generic.py in to_sql(self, name, con, flavor, schema, if_exists, index, index_label, chunksize, dtype)
1199 sql.to_sql(self, name, con, flavor=flavor, schema=schema,
1200 if_exists=if_exists, index=index, index_label=index_label,
-> 1201 chunksize=chunksize, dtype=dtype)
1202
1203 def to_pickle(self, path):
/Users/xx/anaconda/lib/python3.4/site-packages/pandas/io/sql.py in to_sql(frame, name, con, flavor, schema, if_exists, index, index_label, chunksize, dtype)
468 pandas_sql.to_sql(frame, name, if_exists=if_exists, index=index,
469 index_label=index_label, schema=schema,
--> 470 chunksize=chunksize, dtype=dtype)
471
472
/Users/xx/anaconda/lib/python3.4/site-packages/pandas/io/sql.py in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype)
1501 dtype=dtype)
1502 table.create()
-> 1503 table.insert(chunksize)
1504
1505 def has_table(self, name, schema=None):
/Users/xx/anaconda/lib/python3.4/site-packages/pandas/io/sql.py in insert(self, chunksize)
662
663 chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
--> 664 self._execute_insert(conn, keys, chunk_iter)
665
666 def _query_iterator(self, result, chunksize, columns, coerce_float=True,
/Users/xx/anaconda/lib/python3.4/site-packages/pandas/io/sql.py in _execute_insert(self, conn, keys, data_iter)
1289 def _execute_insert(self, conn, keys, data_iter):
1290 data_list = list(data_iter)
-> 1291 conn.executemany(self.insert_statement(), data_list)
1292
1293 def _create_table_setup(self):
OperationalError: table post has no column named A002
答案 0 :(得分:0)
您的问题是您希望从该页面的每个链接中提取表格,并将它们编译成单个数据库表格...但链接中的表格是不同的。指向列表顶部的链接,如
http://web.mta.info/developers/data/nyct/turnstile/turnstile_160312.txt
将第一个/标题行作为:
C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
vs指向页面底部的链接,如
http://web.mta.info/developers/data/nyct/turnstile/turnstile_121222.txt
看起来有很多不同的第一行,例如:
A002,R051,02-00-00,12-15-12,03:00:00,REGULAR,003911852,001349428,12-15-12,07:00:00,REGULAR,003911868,001349432,12-15-12,11:00:00,REGULAR,003911930,001349538,12-15-12,15:00:00,REGULAR,003912146,001349600,12-15-
起初看起来上面的第二页只是缺少一个标题行,但它的顶行(&amp;所有行)看起来不像第一组中的数据行。你能解读为第二组中的那些行调用所有字段的内容吗?
基本上有一些链接(通常在列表中较低的位置),你需要处理的不同于顶级链接,因为表格不同。