我正在尝试使用从文件中读取的列表列表中的一列网址生成数据框。这就是我想要的:
one_df= pd.DataFrame()
with open(r"product_Url.txt", 'r') as infile:
l = [x.split(',') for x in infile]
for x in zip(*l):
df = pd.DataFrame(list(x), columns=['url'])
one_df = one_df.append(df, ignore_index=True)
print(one_df)
one_df.to_csv(outfile)
我输出的问题是我的输出中有几行,其中2个url看起来像这样(例如):
0, ['http://www.ex.com/prod1'
1, 'http://www.ex.com/prod2'
2, 'http://www.ex.com/prod3']['http://www.ex.com/prod25'
3, 'http://www.ex.com/prod43'['http://www.ex.com/prod99']
我从文件中读取的原始起始数据看起来像这样(但有更多网址):
[" ['https://www.ex.com/prod1', 'https://www.ex.com/prod2','https://www.ex.com/prod3']['https://www.ex.com/prod2','https://www.ex.com/prod3']['https://www.ex.com/prod25,'https://www.ex.com/prod43']['http://www.ex.com/prod99']"]
当我尝试直接从文件中读取数据帧时,我得到一个空数据框,每个网址都有一列。因此,我尝试用循环生成数据帧。
我需要做些什么来摆脱这些我有2个网址而不是每行1个网址的情况?
答案 0 :(得分:1)
这可能不是一种有效的方式,但是查看您提供的示例可能会在替换[
,]
并创建dataframe
的情况下起作用:
one_df= pd.DataFrame()
with open("product_Url.txt", 'r') as infile:
l = [x.replace(']', ',').replace("[",'').replace('"','').replace('\n','').strip().split(',') for x in infile]
for x in zip(*l):
df = pd.DataFrame(list(x), columns=['url'])
one_df = one_df.append(df, ignore_index=True)
one_df = one_df[one_df.url.str.len()>0]
print(one_df)
one_df.to_csv(outfile)
结果:
url
0 'https://www.ex.com/prod1'
1 'https://www.ex.com/prod2'
2 'https://www.ex.com/prod3'
3 'https://www.ex.com/prod2'
4 'https://www.ex.com/prod3'
5 'https://www.ex.com/prod25
6 'https://www.ex.com/prod43'
7 'http://www.ex.com/prod99'
一个更清洁的解决方案可能是:
with open('product_Url.txt') as data_file:
data = json.load(data_file)
all_data = [element.replace('[','').replace(']',',').strip().split(',') for element in data]
one_df = pd.DataFrame({'url':all_data[0]})
one_df = one_df[one_df.url.str.len()>0]
one_df.to_csv(outfile)