我想从一系列链接中的所有表中读取和创建数据框。假设我有:
list_links = ['url1.com', 'url2.com', 'url3.com',...,'urln.com']
然后:
for url in lis:
try:
df = pd.read_html(url,index_col=None, header=0)
lis.append(df)
frame = pd.concat(url, ignore_index=True)
except:
pass
然而,我无法获得数据帧,没有任何反应:
In: frame
Out:
In: print(frame)
Out:
哪种方法可以在每个链接中的所有表中将所有表附加到单个表中?请注意,某些链接没有表格...因此我尝试了pass
。我也尝试过这个:
import multiprocessing
def process_url(url):
df_url = pd.read_html(url)
df = pd.concat(df_url, ignore_index=True)
return df_url
pool = multiprocessing.Pool(processes=4)
pool.map(process_url, lis)
然后:
ValueError Traceback (most recent call last)
<ipython-input-3-46e04cfd0bfe> in <module>()
7
8 pool = multiprocessing.Pool(processes=4)
----> 9 pool.map(process_url, lis)
/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
258 in a list that is returned.
259 '''
--> 260 return self._map_async(func, iterable, mapstar, chunksize).get()
261
262 def starmap(self, func, iterable, chunksize=None):
/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
606 return self._value
607 else:
--> 608 raise self._value
609
610 def _set(self, i, obj):
ValueError: No tables found
我也试过这个:
import multiprocessing
def process_url(url):
df_url = pd.read_html(url)
df = pd.concat(df_url, ignore_index=True)
return df_url
pool = multiprocessing.Pool(processes=4)
try:
dfs_ = pool.map(process_url, lis)
except:
pass
没有任何反应。
答案 0 :(得分:0)
您实际上没有加入数据帧。如果你试试这个怎么办:
df_list = []
for url in list_links:
try:
df = pd.read_html(url, index_col=None, header=0)
df_list.append(df)
except:
pass
df = pd.concat(df_list)