Extracting Files from Raw Email with Python

This project was my very first major python project. The first version of this project was just me trying to reinvent the wheel, like any amateur ignorant of the programming world. I was trying hard to separate the parts of a raw email file with no modules. When I was introduced to the modules, I was blown away by how easy and convenient it was to extract that attachment. But, none the less I learned the importance of modules, about python and the raw email structure.

In this project my goal was to automate the task of extracting any files in a raw email file, either from attachments or URL links and then finally upload to AWS S3.

I used the email module to extract the different parts from the raw email file. To work with the file I first turn it into an object using:

msg = email.message_from_file(open(LOCAL_PATH))
 

Separating the contents

I started by separating different parts (payloads) from the raw email using the following code:

if msg.is_multipart():
  for payload in msg.get_payload():
    ls.extend(check_mulitpart(payload))
else:
  ls.extend([msg])
 

In a raw email, a payload can be the main message, html version of that message or attachments. The email can be multi-part, which means it has multiple payloads. However, the payload itself can also be a one part or multi-part. msg.is_multipart() can be used to check if either email or payload is multipart, as I do in the code above. Since the payload itself can be multi-part I use the recursive function check_mulitpart, shown below. The recursion ends when each payload is not multi-part. All such payloads are then added to the list ls.

def check_mulitpart(payload):
  """
  Recursive function that checks if a payload
  is multipart, until it is not.
  """
  if payload.is_multipart():
    for pay in payload.get_payload():
      return check_mulitpart(pay)
  else:
    return [payload]
 

The ls list then contains three types of payloads:

  • Attachments: these are the file attachments included in the email, which are encoded in base64.
  • Plain text: which is mostly just the main body of the email, as simple text.
  • HTML text: which is html body of the email or html version of the message of the text.

Example of a raw email file with an attachment and message

Extracting from the payloads

I use a  for loop to scan each payload and separated the three types of payloads with if statements.

Attachments

Code for extracting attachments:

if payload.get('Content-Disposition'):
  att_filename = ''.join(payload.get_filename().splitlines())

  if check_file_valid(att_filename):
    print("Attachment to be uploaded: %s" % att_filename)
    upload_to_s3(payload.get_payload(decode=True),
           	 att_filename,
             s3, destbucket, destbucketprefix_temp)
 

Only attachments have a Content-Disposition, so I use that in order to detect attachments. Line 2 shows how I get the filename using payload.get_filename(). Long filenames that use up more than one line will cause a “\n” (enter) in the filename and this would cause a huge space to be there when the file was downloaded, to fix this problem I use splitlines and join to remove such an error.

To get the content I used payload.get_payload(decode=True) this gets the content from the payload and also decodes the base64 format. Next you can use the content and filename to save the file as shown below or you can upload it to S3 like I did.

content = payload.get_payload(decode=True)
open(att_filename, 'wb').write(content)
 

HTML text

Extracting attachments is easy because extracting URLs and downloading files from them is a whole new challenge. To extract URLs from HTML I check to see if the Content-Type is “text/html” in a elif clause. Then I used code from https://pythonspot.com/en/extract-links-from-webpage-beautifulsoup/ with some alterations. Instead of html_page variable I used payload.get_payload(decode=True). The decode=True is necessary because the html text is encoded in quoted-printable format.  I changed the regex "^http://" to "^http|https://"because the original missed links using HTTPS. I then append all extracted URLs into the list, url_ls.

Code for HTML text:

elif payload.get_content_type() == 'text/html':
  # https://pythonspot.com/en/extract-links-from-webpage-beautifulsoup/

  soup = BeautifulSoup(payload.get_payload(decode=True))   # payload as the argument instead of html_page variable

  for link in soup.findAll('a', attrs={'href': re.compile("^http|https://")}):
    url_ls.append(link.get('href'))
 

Plain text

To extract URLs from plain text I used the function get_urls_from_plain_part from  https://tutel.me/c/programming/questions/33380726/python++how+to+extract+urls+plainhtml+quoteprintablebase647bit+from+an+email+file and store it in a separate file and imported it into the main python script:

elif payload.get_content_type() == 'text/plain':
    url_ls.extend(get_urls_from_plain_part(payload.get_payload(decode=True)))
 

I used Content-Type again to get “text/plain”. This function is pretty lengthy compared to the other methods on the internet using regex, however, this one has been the most reliable for me. I extend the list returned by the function to the list, url_ls.

Downloading files from the URLs

All the URLs are now stored in the url_ls. To download files from the URLs I use the code below:

for url in list(set(url_ls)):
  try:
    response = urllib2.urlopen(url)
        url = response.geturl()
        
    if url not in check_url_repeats:
      check_url_repeats.append(response.geturl())

      _, params = cgi.parse_header(response.headers.get('Content-Disposition', ''))
      content_type, _ = cgi.parse_header(response.headers.get('Content-Type', ''))

      try:
        if params['filename']: filename = params['filename']
      except KeyError:
        split_path 	 = urlparse.urlsplit(url)
        filename = urllib2.unquote(split_path.path.split("/")[-1])#.encode('string-escape')

      if check_file_valid(filename) and content_type.split('/')[0] != 'image':
        content = response.read()

        print('Filename: %s' % filename)
        print('downloaded from URL: %s' % url)
  except:
      print("Failed to download file from URL: {}".format(url))
      #traceback.print_exc()
  else:
      upload_to_s3(content, filename, s3, destbucket, destbucketprefix)
 

Line 1 shows how I turned the list into a set and back into a list to remove any duplicates. I use the urllib2module to get the content of the URLs, with a simple content = response.read(). This is all easy, the real challenge is getting the filename. Sometimes the URL is not the original, so I get the original with url = response.geturl() and use the check_url_repeatslist to deny any URLs that have already been downloaded, stopping any duplicates. Duplicates are a big issue, because two same files will be downloaded, which makes the code inefficient. Sometimes the original might not even have the filename/path, so I first use the cgi module to try to get the filename from the headers of the website. If that fails I get filename from the URL itself. This technique gives us a higher chance of getting the filename, as without the filename the file can’t be downloaded. **Note: The if statement on line 18 is unnecessary, it just adds extra requirements for the file to be downloaded.

Atlas, I upload the file to S3, but you could just locally download the file with the open in the else clause.