我是Google Cloud的新手,我在GCS中拥有以下文件,需要设计数据流以合并文件并替换产品,位置文件中的值,并将最终输出文件加载到BigQuery中。
本地计算机上的Python代码:
import pandas as pd
df1 = pd.read_csv("C:/Users/xxxx\\actual_file.csv")
df2 = pd.read_csv("C:/Users/xxxx_folder\\product.csv",header=None,names=['id', 'product_name'])
df3 = pd.merge(df1, df2, how='left', left_on='product_id', right_on='id')
df3.drop(['product_id_x', 'id'], axis=1,inplace=True)
df4 = pd.read_csv("C:/Users/xxxx_folder\\location.csv",header=None,names=['id', 'location_name'])
df5 = pd.merge(df3, df4, how='left', left_on='location_id', right_on='id')
df5.drop(['location_id_x', 'id'], axis=1,inplace=True)
df5.rename(columns={'product_name_y':'product_name','location_name_y':'location'}, inplace=True)
df5.to_csv('Final_file.csv', sep=',',encoding='utf-8', index=False)
感谢您的帮助。
答案 0 :(得分:0)
要加入这些行,您想使用GroupByKey
或CoGroupByKey
在文档https://beam.apache.org/documentation/programming-guide/#core-beam-transforms中查看第4.2.3节
emails_list = [
('amy', 'amy@example.com'),
('carl', 'carl@example.com'),
('julia', 'julia@example.com'),
('carl', 'carl@email.com'),
]
phones_list = [
('amy', '111-222-3333'),
('james', '222-333-4444'),
('amy', '333-444-5555'),
('carl', '444-555-6666'),
]
emails = p | 'CreateEmails' >> beam.Create(emails_list)
phones = p | 'CreatePhones' >> beam.Create(phones_list)
# The result PCollection contains one key-value element for each key in the
# input PCollections. The key of the pair will be the key from the input and
# the value will be a dictionary with two entries: 'emails' - an iterable of
# all values for the current key in the emails PCollection and 'phones': an
# iterable of all values for the current key in the phones PCollection.
results = ({'emails': emails, 'phones': phones}
| beam.CoGroupByKey())
def join_info(name_info):
(name, info) = name_info
return '%s; %s; %s' %\
(name, sorted(info['emails']), sorted(info['phones']))
contact_lines = results | beam.Map(join_info)