Question

我正在尝试使用像这样的Beautiful Soup来抓取表单字段ID

 for link in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
    if link.has_key('id'):
        print link['id']

让我们假设它返回类似

的东西

username
email
password
passwordagain
terms
button_register

我想把它写进Sqlite3 DB。

我将在我的应用程序中执行的操作是...使用这些表单字段的ID并尝试执行POST。问题是..有很多像这样的网站，我已经刮了表格字段ID。所以关系是这样的......

Domain1 - First list of Form Fields for this Domain1
Domain2 - Second list of Form Fields for this Domain2
.. and so on

我不确定这里是......我应该如何为这种目的设计我的专栏？如果我只是创建一个包含两列的表 - 比如说

，那会没关系

COL 1 - Domain URL (as TEXT)
COL 2 - List of Form Field IDs (as TEXT)

要记住的一件事是......在我的应用程序中，我需要做这样的事情......

伪代码

If Domain is "http://somedomain.com":
    For ever item in the COL2 (which is a list of form field ids):
         Assign some set of values to each of the form fields & then make a POST request

请问任何一个指南吗？

2011年7月22日编辑 - 我的数据库设计是否正确？

我决定有这样的解决方案。你们觉得怎么样？

我将有三张桌子如下

表1

Key Column (Auto Generated Integer) - Primary Key
Domain as TEXT

示例数据类似于：

1   http://url1.com
2   http://url2.com
3   http://url3.com

表2

Domain (Here I will be using the Key Number from Table 1)
RegLink - This will have the registeration link (as TEXT)
Form Fields (as Text)

示例数据类似于：

1   http://url1.com/register    field1
1   http://url1.com/register    field2
1   http://url1.com/register    field3
2   http://url2.com/register    field1
2   http://url2.com/register    field2
2   http://url2.com/register    field3
3   http://url3.com/register    field1
3   http://url3.com/register    field2
3   http://url3.com/register    field3

表3

Domain (Here I will be using the Key Number from Table 1)
Status (as TEXT)
User (as TEXT)
Pass (as TEXT)

示例数据类似于：

1   Pass    user1   pass1
2   Fail    user2   pass2
3   Pass    user3   pass3

你觉得这个桌子设计好吗？或者是否可以进行任何改进？

Answer 1

您的表格中存在规范化问题。

使用2个表

TABLE domains
int id primary key
text name

TABLE field_ids
int id primary key
int domain_id foreign key ref domains
text value

是一个更好的解决方案。

Answer 2

正确的数据库设计会建议您有一个URL表和一个字段表，每个字段都引用一个URL记录。但是，根据您想要对它们执行的操作，您可以将列表打包到单个列中。请参阅the docs for how to go about that。

sqlite是一个要求吗？它可能不是存储数据的最佳方式。例如。如果您需要通过URL进行随机访问查找，shelve module可能是更好的选择。如果您只需要记录它们并迭代这些站点，那么以CSV格式存储可能会更简单。

Answer 3

试试这个来获取ID：

ids = (link['id'] for link in
        BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')) 
         if link.has_key('id'))

这应该告诉你如何保存它们，加载它们，并为每个人做些什么。这使用单个表，并为每个域的每个字段插入一行。这是最简单的解决方案，非常适合相对较少的数据行。

from itertools import izip, repeat
import sqlite3

conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''create table domains
(domain text, linkid text)''')

domain_to_insert = 'domain_name'
ids = ['id1', 'id2']
c.executemany("""insert into domains
      values (?, ?)""", izip(repeat(domain_to_insert), ids))
conn.commit()

domain_to_select = 'domain_name'
c.execute("""select * from domains where domain=?""", (domain_to_select,))

# this is just an example
def some_function_of_row(row):
    return row[1] + ' value'

fields = dict((row[1], some_function_of_row(row)) for row in c)
print fields
c.close()

将列表存储到Python Sqlite3中

3 个答案: