这是我的数据框:
import pandas as pd
import re
!wget https://s3.amazonaws.com/todel162/elastic.csv
df=pd.read_csv('elastic.csv')
def mysearch(mystring):
urls = re.findall('elastic.co/guide(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', mystring)
return urls
df['mysearch']=df.Body.apply(mysearch)
每列中可以有多个名为mysearch
的网址。我需要将所有唯一的html页面(不是网址)加入到相应的parentID
中,输出将如下所示:
query-dsl-term-query.html 35564374, 46568374
query-dsl-bool-query.html 35594195, 75694493
plugins-inputs-jdbc.html 34203007
答案 0 :(得分:1)
您可以使用:
import pandas as pd
#force column ParentId as string
df=pd.read_csv('https://s3.amazonaws.com/todel162/elastic.csv', dtype={'ParentId':str})
#print (df)
#find all patterns, create new dataframe
pat = 'elastic.co/guide(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
df1 = pd.DataFrame([x for x in df.Body.str.findall(pat)])
#see http://stackoverflow.com/a/37592047/2901002
df1 = df.drop('Body',axis=1).join(df1.stack().reset_index(drop=True, level=1).rename('Body'))
#filter only rows contains .html
df1 = df1[df1.Body.str.contains('.html')]
#split by last `/`
df1['url'] = df1.Body.str.rsplit('/', 1, expand=False).str[1]
#print (df1)
#join by unique url
df2 = df1.groupby('url')['ParentId'].apply(lambda x: ','.join(x.astype(str))).reset_index()
print (df2)
url \
0 _add_an_index.html
1 _add_failover.html
2 _aggregation_test_drive.html
3 _basic_concepts.html
4 _batch_processing.html
5 _best_fields.html
6 _boosting_query_clauses.html
7 _bucket_aggregations.html
8 _buckets_inside_buckets.html
9 _cat_api.html
10 _closer_is_better.html
11 _cluster_health.html
12 _combining_queries_with_filters.html
13 _community_dsls.html
14 _community_integrations.html
15 _configuration.html
16 _controlling_analysis.html
17 _coping_with_failure.html
18 _cross_fields_queries.html
19 _dealing_with_json_arrays_and_objects_in_php.html
20 _dealing_with_null_values.html
21 _delete_an_index.html
22 _deleting_an_index.html
23 _deleting_documents.html
24 _deploying_in_jboss_eap6_module.html
25 _developer_guide_adding_a_new_protocol.html
26 _elasticsearch_net.html
27 _empty_search.html
28 _exact_value_fields.html
29 _executing_aggregations.html
.. ...
923 suggester-context.html
924 synonyms-analysis-chain.html
925 synonyms-expand-or-contract.html
926 tasks.html
927 term-level-queries.html
928 term-vector.html
929 term-vs-full-text.html
930 terms-list-query-usage.html
931 testing-framework.html
932 time-based.html
933 time-units.html
934 token-count.html
935 top-hits.html
936 translog.html
937 transport-client.html
938 unicode-normalization.html
939 unit-tests.html
940 update-doc.html
941 user-based.html
942 using-elasticsearch-test-classes.html
943 using-kibana-for-the-first-time.html
944 using-language-analyzers.html
945 using-stopwords.html
946 using-synonyms.html
947 verbatim-and-strict-query-usage.html
948 visualize.html
949 watch-definition.html
950 watch-log-data.html
951 working-with-plugins.html
952 writing-queries.html
ParentId
0 nan
1 nan
2 nan
3 35958492,nan,35374339,31180988,29818589,32869841
4 34509058
5 33398143
6 33398143,31836937,34069554,31967672,34006197,3...
7 nan,nan,nan
8 nan,30063221
9 29526147
10 31311687,34323428,34255519,30517904
11 36026339
12 33395412,nan,28989479,36325156,nan
13 34143066
14 34143066
15 30886182
16 31591210,35914330,32246656,32463762,nan
17 35078736,nan
18 33398143,34631940,36569635
19 nan
20 nan,nan,nan,nan,nan
21 32872677
22 nan,22924300
23 nan
24 nan,nan
25 34132278
26 nan,30956854
27 31027308,33658619
28 29923047,33757901,nan
29 nan,nan,30280206,nan,nan
.. ...
923 37189942,36802797,36802797,35683069,nan,362040...
924 34358802
925 33250379,34358802
926 36508292
927 34312196
928 32269054,nan,34680820
929 36414571,32264571,32075616,32619266
930 36697563,36565189
931 30755194
932 28984723,33827559,32635456,32718927,nan
933 36752424
934 36025764,34148626,32059804,34882813,34171223
935 nan,nan,nan,29896839,nan,31411664
936 33110371,33110371,35465922
937 nan,35064511,35876176,31453270,nan,27170739,25...
938 nan
939 nan
940 nan,33218812,31424380,nan,nan
941 nan
942 nan
943 33996619
944 30195926,37218517
945 31625943,33370591,36794324,30132959,32694958,3...
946 29254643,34255519,nan,nan
947 37697866,37697866
948 35347332
949 31831689
950 33831247,31831689
951 37007206,31809884
952 nan
[953 rows x 2 columns]