唯一页面的数据框摘要

时间:2016-06-24 03:58:11

标签: python pandas dataframe group-by unique

这是我的数据框:

import pandas as pd
import re

!wget https://s3.amazonaws.com/todel162/elastic.csv

df=pd.read_csv('elastic.csv')

def mysearch(mystring):
    urls = re.findall('elastic.co/guide(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', mystring)
    return urls

df['mysearch']=df.Body.apply(mysearch)

每列中可以有多个名为mysearch的网址。我需要将所有唯一的html页面(不是网址)加入到相应的parentID中,输出将如下所示:

query-dsl-term-query.html 35564374, 46568374
query-dsl-bool-query.html 35594195, 75694493
plugins-inputs-jdbc.html 34203007

1 个答案:

答案 0 :(得分:1)

您可以使用:

import pandas as pd

#force column ParentId as string
df=pd.read_csv('https://s3.amazonaws.com/todel162/elastic.csv', dtype={'ParentId':str})
#print (df)

#find all patterns, create new dataframe
pat = 'elastic.co/guide(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
df1 = pd.DataFrame([x for x in df.Body.str.findall(pat)])

#see http://stackoverflow.com/a/37592047/2901002
df1 = df.drop('Body',axis=1).join(df1.stack().reset_index(drop=True, level=1).rename('Body'))

#filter only rows contains .html
df1 = df1[df1.Body.str.contains('.html')]

#split by last `/` 
df1['url'] = df1.Body.str.rsplit('/', 1, expand=False).str[1]
#print (df1)

#join by unique url
df2 = df1.groupby('url')['ParentId'].apply(lambda x: ','.join(x.astype(str))).reset_index()
print (df2)

                                                   url  \
0                                   _add_an_index.html   
1                                   _add_failover.html   
2                         _aggregation_test_drive.html   
3                                 _basic_concepts.html   
4                               _batch_processing.html   
5                                    _best_fields.html   
6                         _boosting_query_clauses.html   
7                            _bucket_aggregations.html   
8                         _buckets_inside_buckets.html   
9                                        _cat_api.html   
10                              _closer_is_better.html   
11                                _cluster_health.html   
12                _combining_queries_with_filters.html   
13                                _community_dsls.html   
14                        _community_integrations.html   
15                                 _configuration.html   
16                          _controlling_analysis.html   
17                           _coping_with_failure.html   
18                          _cross_fields_queries.html   
19   _dealing_with_json_arrays_and_objects_in_php.html   
20                      _dealing_with_null_values.html   
21                               _delete_an_index.html   
22                             _deleting_an_index.html   
23                            _deleting_documents.html   
24                _deploying_in_jboss_eap6_module.html   
25         _developer_guide_adding_a_new_protocol.html   
26                             _elasticsearch_net.html   
27                                  _empty_search.html   
28                            _exact_value_fields.html   
29                        _executing_aggregations.html   
..                                                 ...   
923                             suggester-context.html   
924                       synonyms-analysis-chain.html   
925                   synonyms-expand-or-contract.html   
926                                         tasks.html   
927                            term-level-queries.html   
928                                   term-vector.html   
929                             term-vs-full-text.html   
930                        terms-list-query-usage.html   
931                             testing-framework.html   
932                                    time-based.html   
933                                    time-units.html   
934                                   token-count.html   
935                                      top-hits.html   
936                                      translog.html   
937                              transport-client.html   
938                         unicode-normalization.html   
939                                    unit-tests.html   
940                                    update-doc.html   
941                                    user-based.html   
942              using-elasticsearch-test-classes.html   
943               using-kibana-for-the-first-time.html   
944                      using-language-analyzers.html   
945                               using-stopwords.html   
946                                using-synonyms.html   
947               verbatim-and-strict-query-usage.html   
948                                     visualize.html   
949                              watch-definition.html   
950                                watch-log-data.html   
951                          working-with-plugins.html   
952                               writing-queries.html   

                                              ParentId  
0                                                  nan  
1                                                  nan  
2                                                  nan  
3     35958492,nan,35374339,31180988,29818589,32869841  
4                                             34509058  
5                                             33398143  
6    33398143,31836937,34069554,31967672,34006197,3...  
7                                          nan,nan,nan  
8                                         nan,30063221  
9                                             29526147  
10                 31311687,34323428,34255519,30517904  
11                                            36026339  
12                  33395412,nan,28989479,36325156,nan  
13                                            34143066  
14                                            34143066  
15                                            30886182  
16             31591210,35914330,32246656,32463762,nan  
17                                        35078736,nan  
18                          33398143,34631940,36569635  
19                                                 nan  
20                                 nan,nan,nan,nan,nan  
21                                            32872677  
22                                        nan,22924300  
23                                                 nan  
24                                             nan,nan  
25                                            34132278  
26                                        nan,30956854  
27                                   31027308,33658619  
28                               29923047,33757901,nan  
29                            nan,nan,30280206,nan,nan  
..                                                 ...  
923  37189942,36802797,36802797,35683069,nan,362040...  
924                                           34358802  
925                                  33250379,34358802  
926                                           36508292  
927                                           34312196  
928                              32269054,nan,34680820  
929                36414571,32264571,32075616,32619266  
930                                  36697563,36565189  
931                                           30755194  
932            28984723,33827559,32635456,32718927,nan  
933                                           36752424  
934       36025764,34148626,32059804,34882813,34171223  
935                  nan,nan,nan,29896839,nan,31411664  
936                         33110371,33110371,35465922  
937  nan,35064511,35876176,31453270,nan,27170739,25...  
938                                                nan  
939                                                nan  
940                      nan,33218812,31424380,nan,nan  
941                                                nan  
942                                                nan  
943                                           33996619  
944                                  30195926,37218517  
945  31625943,33370591,36794324,30132959,32694958,3...  
946                          29254643,34255519,nan,nan  
947                                  37697866,37697866  
948                                           35347332  
949                                           31831689  
950                                  33831247,31831689  
951                                  37007206,31809884  
952                                                nan  

[953 rows x 2 columns]