I have some files stored in a cloud storage bucket and each file contains different variables. What I would like to develop is a function whereby I simply enter in the variables I am interested in and run the function to create a master data set with only those columns/variables. The function iterates through the files and when it finds one of the variable/column names entered as input in the function in one of the files, it grabs that column(s) and joins it to a master dataframe. Below is what I have so far. Any help in developing this further would be very much appreciated.
---
---
from tensorflow.python.lib.io import file_io
files = [o.key for o in storage.Objects(bucket_name, '', '')]
def get_my_data(list1):
df=pd.DataFrame()
files = [o.key for o in storage.Objects(bucket_name, '', '')]
for l in list1:
for f in files:
file1="gs://bucket_name/%s" % f
with file_io.FileIO(file1, 'r') as f:
columns = pd.read_csv(f, nrows=1)
if l in columns:
data=pd.read_csv(f)
print file1, data[l]
#append desired column to our new df
else:
pass
get_my_data(['var1', 'var2', 'var3'])
