I work with clients in “traditional” (not Internet-based) industry. Often, they do have neither centralized database management nor well documented database. Oil and gas business clients, whom I mostly worked with, have the same issue. Since oil production is a complicated process and involves a variety of teams such as geoscientists, engineers and business analysts, their data often exist in different forms and in different places. Normally, they have a relational database but sometimes their data are physical files (PDFs or Excel files). Thus, during a client project, a client normally does “data dump” in which they upload unstructured data (files) in our or their FTP server.
Walk through the server
Obviously, downloading the entire data to a local drive is time-consuming and inefficient. Thus, I was looking for methods to efficiently navigate and filter files and run a download command in Python. Python’s os library provides walk module, which can walk through all the subdirectories and files in a target directory. If we can filter files or folders based on their names, this can quite useful. I found a package called ftptool and its module FTPHost which does the exact same thing as os.walk does.
import os from ftptool import FTPHost # connect to the server ip_address="0.0.0.0" login_id='ID' login_password='PW' a_host = FTPHost.connect(ip_address, user=login_id, password=login_password) directory='/TargetDirectory/' file_summary= for (dirname, subdirs, files) in a_host.walk(directory): # select dirname that has 'dir_substr' at the end after '/' if dirname.split('/')[-1] == 'dir_substr': for file in files: # avoid files like .DS_store if not file.startswith('.'): # get the file name file_name, file_extension = os.path.splitext(file) # Use string manipulation to extract information from the file name # in this case, we are looking for a vendor name: # the first word before the first dash (after removing spaces) vendor_name = file_name.split('-').strip() # combine everything together (make everthing upper case for consistency) # we are also saving the entire file path in the FTP for download file_summary.append(vendor_name.upper(), file_extension.upper(), os.path.join(dirname, file))
Here, we loop over every folder and file in TargetDirectory and selectively go through a specific directory filtered by a substring to reduce search time (if this is possible). Then, we skip the files that start with a dot, and use string manipulation to further extract information from file names. Finally, we populate a list with the extracted information (vendor_name), file extension (e.g., .PDF or .csv) and the full file path in the FTP, which can be used in file download.
Download selected files
Technically, it should be possible to download files using the same library, ftptool but it doesn’t seem to work. I found a different library ftplib in which downloading files from the server to a local drive works just fine.
# connect to server using ftplib.FTP import os import FTP from ftplib ftp = FTP(ip_address) ftp.login(login_id,login_password) local_path = "/LocalPath/" # paths_for_download correspond to an array or list of full file paths # obtained from the previous step for path_for_download in paths_for_download: # create a file name to save as in the local drive # here, keep the file name as same ftp_filename = path_for_download.split('/')[-1] local_filename = os.path.join(local_path, ftp_filename) # Retrieve a file in binary transfer mode (basically writing a file) lf = open(local_filename, "wb") ftp.retrbinary("RETR " + path_for_download, lf.write) lf.close() # make sure you close the local file after writing
Since we’re using a different FTP protocol package, we need to reconnect to the server. Then use the full file paths we got from the previous step (var name paths_for_download) to locate each file of the list in the server. The rest step is basically to write/retrieve a file.
Export files as a zip file
In my case, after analyzing the filtered files that I downloaded, I had to send some files that show abnormal behavior (normally because of their contents or formats). If the entire analysis process exits as a Python code, it’d be nice to create a zip file out of labelled files by running a code too, especially if this process requires multiple runs. We can use zipfile package.
import os import zipfile def zip_files(src, dst, files): """ Returns a zip file (dst.zip) by zipping files from src input arguemnt "files" is given as a collection of file names (no directory) """ # dst defines the name of the zip file which will be created zf = zipfile.ZipFile("%s.zip" % (dst), "w", zipfile.ZIP_DEFLATED) abs_src = os.path.abspath(src) for filename in files: absname = os.path.abspath(os.path.join(src, filename)) arcname = absname[len(abs_src) + 1:] zf.write(absname, arcname) zf.close() # dst should include the directory, outliers is given as a list of file names that I want to compress zip_files(local_path, os.path.join(local_path,'outlier'), outliers)