Connect Python and FTP server

I work with clients in “traditional” (not Internet-based) industry. Often, they do have neither centralized database management nor well documented database. Oil and gas business clients, whom I mostly worked with, have the same issue. Since oil production is a complicated process and involves a variety of teams such as geoscientists, engineers and business analysts, their data often exist in different forms and in different places. Normally, they have a relational database but sometimes their data are physical files (PDFs or Excel files). Thus, during a client project, a client normally does “data dump” in which they upload unstructured data (files) in our or their FTP server.

Walk through the server

Obviously, downloading the entire data to a local drive is time-consuming and inefficient. Thus, I was looking for methods to efficiently navigate and filter files and run a download command in Python. Python’s os library provides walk module, which can walk through all the subdirectories and files in a target directory. If we can filter files or folders based on their names, this can quite useful. I found a package called ftptool and its module FTPHost which does the exact same thing as os.walk does.

import os
from ftptool import FTPHost

# connect to the server
ip_address="0.0.0.0"
login_id='ID'
login_password='PW'

a_host = FTPHost.connect(ip_address, user=login_id, password=login_password)
directory='/TargetDirectory/'

file_summary=[]
for (dirname, subdirs, files) in a_host.walk(directory):

    # select dirname that has 'dir_substr' at the end after '/'
    if dirname.split('/')[-1] == 'dir_substr':

    for file in files:

    	# avoid files like .DS_store
        if not file.startswith('.'):

            # get the file name
            file_name, file_extension = os.path.splitext(file)

            # Use string manipulation to extract information from the file name
            # in this case, we are looking for a vendor name:
            # the first word before the first dash (after removing spaces)
            vendor_name = file_name.split('-')[0].strip()

            # combine everything together (make everthing upper case for consistency)
            # we are also saving the entire file path in the FTP for download
            file_summary.append(vendor_name.upper(),
            					file_extension.upper(),
        						os.path.join(dirname, file))

Here, we loop over every folder and file in TargetDirectory and selectively go through a specific directory filtered by a substring to reduce search time (if this is possible). Then, we skip the files that start with a dot, and use string manipulation to further extract information from file names. Finally, we populate a list with the extracted information (vendor_name), file extension (e.g., .PDF or .csv) and the full file path in the FTP, which can be used in file download.

Download selected files

Technically, it should be possible to download files using the same library, ftptool but it doesn’t seem to work. I found a different library ftplib in which downloading files from the server to a local drive works just fine.

# connect to server using ftplib.FTP
import os
import FTP from ftplib
ftp = FTP(ip_address)
ftp.login(login_id,login_password)

local_path = "/LocalPath/"
# paths_for_download correspond to an array or list of full file paths
# obtained from the previous step
for path_for_download in paths_for_download:

    # create a file name to save as in the local drive
    # here, keep the file name as same
    ftp_filename = path_for_download.split('/')[-1]
    local_filename = os.path.join(local_path, ftp_filename)

    # Retrieve a file in binary transfer mode (basically writing a file)
    lf = open(local_filename, "wb")
    ftp.retrbinary("RETR " + path_for_download, lf.write)
    lf.close() # make sure you close the local file after writing

Since we’re using a different FTP protocol package, we need to reconnect to the server. Then use the full file paths we got from the previous step (var name paths_for_download) to locate each file of the list in the server. The rest step is basically to write/retrieve a file.

Export files as a zip file

In my case, after analyzing the filtered files that I downloaded, I had to send some files that show abnormal behavior (normally because of their contents or formats). If the entire analysis process exits as a Python code, it’d be nice to create a zip file out of labelled files by running a code too, especially if this process requires multiple runs. We can use zipfile package.

import os
import zipfile

def zip_files(src, dst, files):
	"""
		Returns a zip file (dst.zip) by zipping files from src
		input arguemnt "files" is given as a collection of file names (no directory)
	"""

	# dst defines the name of the zip file which will be created
    zf = zipfile.ZipFile("%s.zip" % (dst), "w", zipfile.ZIP_DEFLATED)
    abs_src = os.path.abspath(src)

    for filename in files:
        absname = os.path.abspath(os.path.join(src, filename))
        arcname = absname[len(abs_src) + 1:]
        zf.write(absname, arcname)
    zf.close()

# dst should include the directory, outliers is given as a list of file names that I want to compress
zip_files(local_path, os.path.join(local_path,'outlier'), outliers)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s