Question

不知道标题是否足够好。随意调整它！

以下是情况：我的数据框基本上是产品目录。在这里有两个重要的专栏。一个是产品ID，一个是12位数类别。这是一些示例数据。当然，原始数据包含更多产品，更多列和许多不同类别。

products = [
    {'category': 110401010601, 'product': 1000023},
    {'category': 110401020601, 'product': 1000024},
    {'category': 110401030601, 'product': 1000025},
    {'category': 110401040601, 'product': 1000026},
    {'category': 110401050601, 'product': 1000027}]

pd.DataFrame.from_records(products)

任务是使用12位数的类别编号来形成父类别，并使用这些父类来计算与该父类别匹配的产品数量。父类别以2位数步骤形成。每个父母的计数稍后用于查找具有最小记录数的每个产品的父母（假设12个孩子）。当然，数字越短，产品匹配的数量就越多。这是一个示例父结构：

110401050601 # product category
1104010506 # 1st parent
11040105 # 2nd parent
110401 # 3rd parent
1104 # 4th parent
11 # 5th super-parent

您看到可能有更多产品匹配，例如1104而不仅仅是110401050601.

小数据的创意1 只要您将小型或中型数据完全加载到Pandas数据帧中，这是一项简单的任务。我用这个代码解决了它。缺点是这段代码假设所有数据都在内存中，而每个循环都是另一个选择到完整数据帧，这在性能方面并不好。示例：对于100.000行和6个父组（由12位数组成），您最终可能会通过DataFrame.loc[...]选择600.000，从而逐渐增长（最坏情况）。为了防止这种情况，如果以前见过父母，我就会打破这个循环。备注：df.shape[0]方法类似于len(df)。

df = df.drop_duplicates()
categories = df['category'].unique()

counts = dict()
for cat in categories:
    counts[cat] = df.loc[df['category'] == cat].shape[0]

    for i in range(10,1,-2):
        parent = cat[:i]

        if parent not in counts:
            counts[parent] = df.loc[df['category'].str.startswith(parent)].shape[0]
        else:
            break

counts = {key: value for key, value in counts.items() if value >= MIN_COUNT}

结果是这样的（使用原始数据的一部分）：

{'11': 100,
 '1103': 7,
 '110302': 7,
 '11030202': 7,
 '1103020203': 7,
 '110302020301': 7,
 '1104': 44,
 '110401': 15,
 '11040101': 15,
 '1104010106': 15,
 '110401010601': 15}

使用flatmap-reduce的大数据的想法2：现在想象你有更多的数据是按行加载的，你想要实现与上面相同的东西。我正在考虑使用flatmap将类别编号拆分为其父项（一对多），使用每个父项的一个计数器，然后应用groupby-key来获取所有可能父项的计数。 此版本的优点是，它不需要同时显示所有数据，也不会对数据帧进行任何选择。但是在flatmap-step中，行数增加了6倍（由于12位数的类别编号分为6组）。由于Pandas没有flatten/flatmap方法，我必须使用unstack来解决问题（解释see this post）。

df = df.drop_duplicates()
counts_stacked = df['category'].apply(lambda cat: [(cat[:i], 1) for i in range(10,1,-2)])
counts = counts_stacked.apply(pd.Series).unstack().reset_index(drop=True)

df_counts = pd.DataFrame.from_records(list(counts), columns=['category', 'count'])
counts = df_counts.groupby('category').count().to_dict()['count']
counts = {key: value for key, value in counts.items() if value >= MIN_COUNT}

问题：两种解决方案都很好，但我想知道是否有更优雅的方法来实现相同的结果。我觉得我错过了什么。

Answer 1

您可以在此使用cumsum

df.category.astype(str).str.split('(..)').apply(pd.Series).replace('',np.nan).dropna(1).cumsum(1).stack().value_counts()
Out[287]: 
11              5
1104            5
110401          5
11040102        1
110401050601    1
1104010206      1
110401040601    1
11040101        1
1104010106      1
110401010601    1
110401020601    1
11040104        1
110401030601    1
11040103        1
1104010406      1
1104010306      1
11040105        1
1104010506      1
dtype: int64

Answer 2

这是使用Apache Beam SDK for Python的另一种解决方案。这与使用map-reduce范例的大数据兼容。示例文件应包含产品ID作为第一列，12位数类别作为第二列，使用;作为分隔符。这段代码的优雅之处在于你可以很好地看到每行的每个转换。

# Python 2.7

import apache_beam as beam
FILE_IN = 'my_sample.csv'
SEPARATOR = ';'

# the collector target must be created outside the Do-Function to be globally available
results = dict()

# a custom Do-Function that collects the results
class Collector(beam.DoFn):    
    def process(self, element):
        category, count = element
        results[category] = count
        return { category: count }


# This runs the pipeline locally.
with beam.Pipeline() as p:
    counts = (p
     | 'read file row-wise' >> beam.io.ReadFromText(FILE_IN, skip_header_lines=True)
     | 'split row' >> beam.Map(lambda line: line.split(SEPARATOR))
     | 'remove useless columns' >> beam.Map(lambda words: words[0:2])
     | 'remove quotes' >> beam.Map(lambda words: [word.strip('\"') for word in words])
     | 'convert from unicode' >> beam.Map(lambda words: [str(word) for word in words])
     | 'convert to tuple' >> beam.Map(lambda words: tuple(words))
     | 'remove duplicates' >> beam.RemoveDuplicates()
     | 'extract category' >> beam.Map(lambda (product, category): category)
     | 'create parent categories' >> beam.FlatMap(lambda cat: [cat[:i] for i in range(12,1,-2)])
     | 'group and count by category' >> beam.combiners.Count.PerElement()
     | 'filter by minimum count' >> beam.Filter(lambda count: count[1] >= MIN_COUNT)
     | 'collect results' >> beam.ParDo(collector)
    )

result = p.run()
result.wait_until_finish()

# investigate the result; 
# expected is a list of tuples each consisting of the category and its count
print(results)

代码是用Python 2.7编写的，因为Apache Beam SDK for Python还不支持Python 3。

数据帧：从单个ID中提取多个父项并计算出现次数

2 个答案: