Question

我正在迭代几个文件中的数十万个单词，希望找到英语收缩的频率。我已经适当地格式化了文档，现在是编写正确的函数并正确存储数据的问题。我需要存储每个文档的信息，以查找文件中使用的收缩频率和使用频率。理想情况下，我的数据框看起来如下所示：

filename    contraction    count
file1       it's           34
file1       they're        13
file1       she's          9
file2       it's           14
file2       we're          15
file3       it's           4
file4       it's           45
file4       she's          13

我怎样才能最好地解决这个问题？

编辑：到目前为止，这是我的代码：

for i in contractions_list:     # for each of the 144 contractions in my list
    for l in every_link:        # for each speech
        count = 0
        word_count = 0
        content_2 = processURL_short(l)
        for word in content2.split():
            word = word.strip(p)
            word_count = word_count + 1
            if i in contractions:
                count = count + 1

processURL_short()是我编写的一个函数，用于搜索网站并返回str的语音。

EDIT2：

link_store = {}
for i in contractions_list_test:     # for each of the 144 contractions
    for l in every_link_test:        # for each speech
        link_store[l] = {}
        count = 0
        word_count = 0
        content_2 = processURL_short(l)
        for word in content_2.split():
            word = word.strip(p)
            word_count = word_count + 1
            if word == i:
                count = count + 1
        if count: link_store[l][i] = count
        print i,l,count

这是我的文件命名代码：

splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president,speech_num)

Answer 1

打开和读取操作很慢：不要在整个文件列表中循环144次。

例外情况很慢：每次演讲中每次非收缩都会引发异常会很惨。

不要在你的收缩清单中循环检查。相反，使用内置的 in 函数来查看该收缩是否在列表中，然后使用字典来计算条目，就像您手动执行一样。

逐字逐句浏览文件。当你在收缩清单上看到一个单词时，看看它是否已经在你的计数表上。如果是，请添加标记，如果不是，则将其添加到计数为1的工作表中。

这是一个例子。我做了非常简短的演讲和一个简单的 processURL_short 功能。

def processURL_short(string):
    return string.lower()

every_link = [
    "It's time for going to Sardi's",
    "We're in the mood; it's about DST",
    "They're he's it's don't",
    "I'll be home for Christmas"]

contraction_list = [
    "it's",
    "don't",
    "can't",
    "i'll",
    "he's",
    "she's",
    "they're"
]

for l in every_link:        # for each speech
    contraction_count = {}
    content = processURL_short(l)

    for word in content.split():
        if word in contraction_list:
            if word in contraction_count:
                contraction_count[word] += 1
            else:
                contraction_count[word] = 1

    for key, value in contraction_count.items():
        print key, '\t', value

Answer 2

你可以像这样设置你的结构：

links = {
'file1':{
    "it's":34,
    "they're":14,
    ...,
    },
'file2':{
    ....,
    },
...,
}

你最终会得到这样的数据结构：

  public RelayCommand DeleteCmd { get; private set; }
    public DeleteVM()
    {           
        DeleteCmd = new RelayCommand(() => ShowDelete(), () => true);
    }
     private void ShowDelete()   
     {System.Windows.Application.Current.Dispatcher.Invoke

       (DispatcherPriority.Normal,(Action)delegate()
         {
           BtnXVisibility = Visibility.Hidden;
           BtnYesVisibility = System.Windows.Visibility.Visible;

           LblConfirmVisibility = System.Windows.Visibility.Visible;
           Background = Brushes.White;
          });
    }

您可以轻松地迭代以将必要的数据写入您的文件（我再次假设您知道该怎么做，因为它似乎不是问题的一部分）

Answer 3

词典似乎是最好的选择，因为它们会允许您可以更轻松地处理数据。你的目标应该是索引文件名的结果从link（您的演讲的网址）中提取文本）到收缩及其计数的映射。

类似的东西：

{"file1": {"it's": 34, "they're": 13, "she's": 9},
 "file2": {"it's": 14, "we're": 15},
 "file3": {"it's": 4},
 "file4": {"it's": 45, "she's": 13}}

这里是完整的代码：

ret = {}
for link, text in ((l, processURL_short(l)) for l in every_link):
    contractions = {c:0 for c in contractions_list}
    for word in text.split():
        try:
            contractions[word] += 1
        except KeyError:
            # Word or contraction not found.
            pass
    ret[file_naming_code(link)] = contractions

让我们进入每一步。

首先我们初始化ret，它将是结果字典。然后我们用 generator expressions 为每一步执行processURL_short()（而不是执行所有链接列表一次）。我们返回元组(<link-name>, <speech-test>)的列表，以便稍后使用链接名称。
接下来那是收缩计数映射，初始化为0 s，它将用于计算收缩。
然后我们将文本分成单词，对于我们搜索的每个单词在收缩映射中，如果发现我们计算它，否则 <{1}}将针对未找到的每个密钥加注。

（另一个问题是，这将表现不佳，另一个可能性正在与KeyError进行核对，例如in。）
最后：
```
word in contractions
```
现在ret[file_naming_code(link)] = contractions是文件名映射到收缩的字典发生。现在，您可以使用以下方法轻松创建表格：

以下是您获得输出的方式：

ret

在迭代文档时存储3个不同的变量（字典或列表）？

3 个答案: