我已经使用以下代码完成了网页抓取:
Number = soup.find('th',text = "Number of samples").find_next_sibling("td").text
for x in range(1,int(Number)+1): #loop of function to parse the data format I want
item = item_text.split('tooltip')[x].split("class")[0].replace('"','').replace(',','').replace(':','').replace("<br>"," ").replace("/","").replace("\\","")
#print(item)
TESTDATA=StringIO(item)
df = pd.read_csv(TESTDATA, sep=" ",header=None)
print(df)
现在结果如下:
0 1 2 3 4 5 6 7 8 9 \
0 TCGA-KK-A7B3-01A Male NaN Stage not reported NaN Alive FPKM 5.5
10 11 12 13 14
0 Living days 899 (2.5 years)
0 1 2 3 4 5 6 7 8 9 \
0 TCGA-G9-6347-01A Male NaN Stage not reported NaN Alive FPKM 14.2
10 11 12 13 14
0 Living days 2089 (5.7 years)
...
现在的问题是如何将这些独立的数据帧组合成一个数据帧,以便更容易保存到整个csv文件中?
谢谢
答案 0 :(得分:0)
all_dataframes = []
for x in range(1,int(Number)+1):
....
df = pd.read_csv(TESTDATA, sep=" ",header=None)
all_dataframes.append(df)
concat_df = pd.concat(all_dataframes)