Question

我正在处理一个大型数据集，该数据集迭代地为特定父URL提取n个子URL。

我最初使用excel来记录数据（实际测试我的代码）。但后来发现这个想法不值得，因为输出数据很大。

例如：我有两组数据：

amazon.com: ['a','b','c','d','e']
a         : ['k','j','e','f']

在第一种情况下，amazon.com是父URL，值列表是它的子URL。
在下一种情况下，a成为父网址，值列表是其子网址。

现在我真正需要的是获得如下数据框：

               a    b    c    d    e    k    j    f
 amazon.com    1    1    1    1    1
     a                             1    1    1    1

其中1可以假设为显示说a is the child of amazon.com

的值

现在问题是我没有上面显示的数据。它们是在我浏览网站时动态获取的。

所以流程将是：

Open a website URL
records the URL (parent URL - this is where we get the URL)
records all the URLs present in the page (child URL - this is where we get all the child URLs corresponding to the parent URL and hence can populate our list/dictionary and hence the dataframe)

可以注意到，找不到重复的列标题。

有人可以帮我解决这个问题吗？

Answer 1

希望这会有所帮助：

let mut cur_link = mem::replace(&mut self.head, Link::Empty)

Pandas：动态添加行和列以及输入值

1 个答案: