Question

目标是获取多个带有字符串的单元格，并以以下格式将它们连接起来：“字符串1 |字符串2”。我正在使用的代码执行此操作，但结果字符串明显短于应有的长度。 / p>

非连续长度：7,752,545个字符

连续长度：2,118个字符

简而言之，数据集有两列。一个包含不同地理区域的名称，另一个包含公司名称。我需要将区域作为类别，如上所述将所有公司名称串联在一起。

什么会导致大多数字符串不出现在格式化数据集中？

## This is what is used to concatenate the strings
def preprocessing(dataset, title, keywords):
    dataset[keywords] = dataset[keywords].replace(' ', '_', regex = True)
    df = dataset.groupby(title)[keywords].apply(lambda x: '|'.join(str(x).split()))
    df = pd.DataFrame(df)
    df[keywords] = df[keywords].replace('_', ' ', regex = True)
    return(df)
## 
region_prep = preprocessing(geo_region, 'Region', 'Company Name')

所需结果

带有这样的文件：

geo_region

Company Name   Region
walmart        north america
amazon         north america
google         north america

我正在寻找一个看起来像这样的结果：

region_prep

north america    walmart|amazon|google

连接字符串时字符消失

0 个答案: