Question

我有以下脚本，它循环遍历css规则的文本文件，并将每个规则及其属性存储在字典中（对代码欢迎的改进，我只是刚开始使用Python）：

findGroups.py

import sys
source = sys.argv[1]
temp = open('pythonTestFile.txt', 'w+')
di = {}
with open(source, 'r') as infile:
    for line in infile:
        # if line starts with . or #, contains _ or - between 
        # words and has a space and opening brace(ie is css rule name)
        if re.search('((([\.\-\'])?\w+\s?\{', line):
           key = line.replace("{", "")
           di[key] = []
           line = next(file)
           while "}" not in line:
               # remove trailing whitespace and \n 
               line = ' '.join(line.split())
               di[key].append(line)
               line = next(infile)
temp.close();

的Source.txt

* {
    min-height: 1000px;
    overflow: hidden;
}

.leftContainerDiv {
    font-family: Helvetica;
    font-size: 10px;
    background: white;
}

#cs_ht_panel{   
    font-family:10px;
    display:block;
    font-family:Helvetica;
    width:auto;
}
//...etc

最好，我希望输出看起来像这样（可读性的建议也欢迎）：

pythonTestFile.txt

Group 1, count(2) - font-family: Helvetica; + font-size: 10px;
Group 2: //...etc

我现在想要做的是弄清楚哪些css属性是重复发生的组，例如，如果font-size：10px和font-family：Helvetica一起出现在规则中，该组是否出现在其他任何一个中规则以及它发生了多少次。

我不完全确定在哪里使用它，我无法弄清楚如何开始某种比较算法，或者字典是用于存储文本的正确数据结构。

编辑：在回复评论时，我无法使用第三方库。这个脚本将在Red Hat VM上使用，只有预先批准的软件可以推送到这些，我无法只下载库或软件包

Answer 1

为每个css属性分配不同的素数，例如：

{
    'diplay: block': 2
    'font-size: 10px': 3,
    'font-family: Helvetica': 5,
    'min-height: 1000px': 7,
    'overflow: hidden': 11,
    'width: auto': 13,
    'background: white': 17,
}

然后制作一个dict，其中键是css选择器（你所说的＆＃34;规则＆＃34;），并且值是它拥有的所有属性的乘积：

{
    '#cs_ht_panel': 390, # 2 * 3 * 5 * 13
    '*': 77, # 7 * 11
    '.leftContainerDiv': 255, # 3 * 5 * 17
}

现在您可以轻松确定两件事：

选择器（＆＃34;规则＆＃34;）通过查看具有属性x（由其素数表示）或一组属性{x,y,z,..}（由其素数的乘积表示）如果选择器号可被该数字整除。
例如哪些选择器同时具有'font-family: Helvetica'（5）和font-size: 10px（3）？所有且只有那些可以被15整除的那些。
通过计算GCD（最大公约数），两个选择器的所有属性都有共同点例如GCD（390,77）= 1 - ＆gt;他们没有共同的属性 GCD（390,255）= 15 - >分解 - ＆gt; 3 * 5

您还可以通过迭代所有选择器值找到最常见的组，找到所有不是素数的正确除数，并保留一个dict，用于保存已找到除数的数量。每个除数都是一个组，你可以通过分解来找到它的元素。

390 - ＆gt; 6 10 15 26 30 39 65 78 130 195
255 - ＆gt; 15 51 85
77 - ＆gt;

所以你有两次15次，其他所有人都有一次。这意味着组15中有2次出现，即属性3和组5。

最后的计算步骤是2 ^ n，其中n是该css选择器中的属性数。这不应该是一个问题，因为大多数选择器具有少于10个属性，但是超过20个属性并且您开始遇到麻烦。我建议通过删除前缀（moz-，webkit-）和后缀（-left，-right，-top，-bottom）来压缩属性

您可以（并且可能应该，对于具有数百行的真实CSS文件）仅使用集合及其操作（交集等）而不是数字，产品和素数来完成所有这些操作;但这不是很酷吗？ ;）

Answer 2

基于上述想法的解决方案 - 而不是使用素数 - 我使用集合和有序列表。可能这是你想要的吗？

import re
import itertools

f = open('css_test.txt', 'r')
lines = f.readlines()
lines_str = ' '.join([l.strip() for l in lines])
#print lines_str

r = re.compile(r'.*?{(.*?)}') # Get All strings between {}
groups = r.findall(lines_str)
#print groups

# remove any stray spaces in the string and create groups of attributes like
# style: value
grps = []
for grp in groups:
    grps.append(filter(lambda x: len(x) > 0, grp.strip().split(';')))


# clean up those style: value attributes so that we get 'style:value'
# without any spaces and also collect all such attributes (we'd later create
# a set of these attributes)
grps2 = []
all_keys = []
for grp in grps:
    grp2 = []
    for g in grp:
        x = ':'.join([x.strip() for x in g.split(':')])
        grp2.append(x)
        all_keys.append(x)
    grps2.append(grp2)
set_keys = set(sorted(all_keys))

print set_keys
print '***********'
set_dict = {}
# For each combination of 2 of keys in the set find intersection of this
# set with the set of keys in the cleaned up groups above
# if intersection is the set of 2 keys: initialize a dictionary or add 1
for x in itertools.combinations(set_keys, 2):
    for g in grps2:
        set_x = set(x)
        set_g = set(g)
        #print "set_g : ", set_g
        if set_x  & set_g == set_x:
            print set_x
            if set_dict.has_key(x):
                set_dict[x] += 1
            else:
                set_dict[x] = 1

# print everything
print set_dict

即使这个解决方案与您想要的完全匹配 - 您可以按照上述思路来达到您想要做的事情吗？

Python - 在字典中查找重复的对/值组

2 个答案: