Six steps to mastering documentation for your python code

The importance of good documentation for our project is highly rated and regarded for a reason. Plain and simple, it enables others to understand, use and enhance our code.

But here’s another truth and let’s accept it, we as developers suck at it! And it’s okay because we are not authors or technical writers. We just like to build things and code!

So, how can we continue to do what we are good at as well as create something that is easily accepted by others (developers/users)?

Right! We are developers so we obviously have some automated tools at our doorstep. Let’s embrace one of them called sphinx (authored by Georg Brandl [respect!])

Six steps to mastering documentation for your python code

  1. Step1: Python project structure
  2. Step2: Code comments and
  3. Step3: Installing and Setting up
  4. Step4: Generating the documentation
  5. Step5: Uploading to github
  6. Step6: Hosting documentation online

Step1: Python project structure

For the sake of clarity and more importantly not fighting the “module not found” error :), I will be using the following project structure:

--pythonic_documentation/src/...other files...

In this structure please make note of the following:

  1. docs/ is an empty folder at the beginning. It will contain:
    …1.1 Source files (.rst and .md files)
    …1.2 Build files (.html)
  2. src/ contains our python code
  3. resources/ may contain other supporting files like config files

Step2: Code comments and

github has done a great job at making us all used-to having the bare minimum So, I wont delve into it.

For the api reference, the inline comments in our code are transformed into the api doc. I assume we should have some aspects of this covered in your code. However, if you are unsure, I highly recommend going through this 1 min read:

Step3: Installing and Setting up

We will be using the following two packages for automated document generation
1. sphinx:
….is the heart of generating the documentation. By default, it reads the .rst files instead of .md
2. recommonmark:
….is a utility that works with sphinx to help process .md files

Step3.1: Install the required packages:

> cd docs\
> pip install sphinx
> pip install recommonmark

Step3.2: Setup sphinx:

> sphinx-quickstart
## Use all defaults when asked for except for the following for which you should choose 'y'
autodoc: automatically insert docstrings from modules (y/n) [n]: y

You will see a series of folders and files getting created in your docs\ folder. This is normal 🙂

Step3.3: Include the following to

This is done to include the project in the import path, for auto-api to pick up comments from our code
This also enables the CommonMarkParser that helps sphinx read the .md files

import os
import sys
sys.path.insert(0, os.path.abspath('..'))

source_parsers = {
'.md': 'recommonmark.parser.CommonMarkParser',

Step3.4: Amend the following to

This is done to process .md files with sphinx

source_suffix = ['.rst', '.md'] # include .md here

Step3.5: Include the following to make.bat

This is done to copy the to our docs\ folder to be picked up by

if exist "..\" (COPY /Y ..\

from shutil import copy
from pathlib import Path
from_path = (Path(".") / "../" ).resolve().__str__()
copy(from_path, '.')

Step3.6: Amend the following to index.rst:

This is done to list the README file on our index.html page

.. toctree::
:maxdepth: 1
:caption: Contents:


Step4: Generating the documentation (html files in our case)

The documentation that we will generate using sphinx will contain the following parts:
1. Various html files with the top page being index.html
2. A TOC side bar
3. A contents section that points to other html pages
4. An index and a module index (this is the api document!)

Generating this is as simple as:

> sphinx-apidoc -f -o . ..
> make html

The resulting html files can be found in docs_build\html

Go ahead, open the index.hmll file in your fav browser and see for your self.

Step5: Uploading to github

This should be pretty straight forward, but in case you aren’t comfortable find the code below:

> cd C:\......\pythonic_documentation\
# now go to
# create new repo. keep all defaults. I named it 'pythonic_documentation'
# follow the instruction from github, in my case these are:
> git init
# i also created a .gitignore file so that I don't include resource and other libraries not required
# my .gitignore looks like:
#   venv/*
#   .idea/*
#   docs/_build/*
#   docs/_static/*
#   docs/_templates/*
> git add .
> git commit -m"first commit"
> git remote add origin
> git push -u origin master

This hosts our code on github. For me, at

Step6: Hosting documentation online

We will use for this
Assuming, you have set up an account on readthedocs, import your project and
click-> build

And there, you have your documentation online !
For me, the URL is

Check the link for the api doc:

Some gotchas:

Pay close attention to the advanced setting of your project in readthedocs. Go to: project > Admin > Advanced Settings
For instance,
1. Make sure the python version is correct
2. Requirements file is specified if you require a new virtual environment

Additionally, some issues may occur if your project depends on other packages that require C libraries (e.g. numpy). In which case, make use of the mock library and include the following code in your file:

from mock import Mock as MagicMock

class Mock(MagicMock):
def __getattr__(cls, name):
return MagicMock()

MOCK_MODULES = ['numpy', 'scipy', 'scipy.linalg', 'scipy.signal']
sys.modules.update((mod_name, Mock()) for mod_name in MOCK_MODULES)

How to comment your code in python – a sample

# -*- coding: utf-8 -*-
This is my module brief line.

This is a more complete paragraph documenting my module.

- A list item.
- Another list item.

This section can use any reST syntax.

"""This is an important constant."""

    'this': 'that',
    'jam': 'eggs',
    'yet': {
        'things': [1, 2, 3, 'a'],
        'tuples': (A_CONSTANT, 4)
"""Yet another public constant variable"""

def a_function(my_arg, another):
    This is the brief description of my function.

    This is a more complete example of my function. It can include doctest,
    code blocks or any other reST structure.

    >>> a_function(10, [MyClass('a'), MyClass('b')])

    :param int my_arg: The first argument of the function. Just a number.
    :param another: The other argument of the important function.
    :type another: A list of :class:`MyClass`
    :rtype: int
    :return: The length of the second argument times the first argument.
    return my_arg * len(another)

class MyClass(object):
    This is the brief of my main class.

    A more general description of what the class does.

    :param int param1: The first parameter of my class.
    :param param2: The second one.
    :type param2: int or float
    :var my_attribute: Just an instance attribute.
    :raises TypeError: if param2 is not None.

    class_attribute = 625
    """This is a class attribute."""

    def __init__(self, param1, param2=None):
        self.param1 = param1
        if param2 is not None:
            raise TypeError()
        self.param2 = param2
        self.my_attribute = 100

    def my_method(self, param1, param2):
        The brief of this method.

        This method does many many important things.

        :param int param1: A parameter.
        :param list param2: Another parameter.
        :rtype: list of int
        :return: A list of the first parameter as long a the length of the
         second parameter.
        return [param1] * len(param2)

class AnotherClass(MyClass):
    This another class.

    Check the nice inheritance diagram. See :class:`MyClass`.

class MyException(Exception):
    This is my custom exception.

    This is a more complete description of what my exception does. Again, you
    can be as verbose as you want here.

Effective (and easy) logging in python (part-2)

This post is a continued attempt to take baby steps towards mastering logging in python

In the previous post here, we created three loggers and configured them programatically.

However, this has one BIG pitfall – every time you want to change the logging you have to change the code! Imagine what that means, when your code is running in production and just because you want to switch the logger level from WARN to DEBUG, you have to edit the code in production? Or pushing the build into production just for this! Really!!!!

Thankfully, there are ways of mitigating this with the powerful yet easily configurable logging configuration file.

A note to my fellow developers:

Don’t get afraid of using these config files. It may seem overwhelming at first but they will be your best friend when you start building code for production environments. If you are interested to understand further, I recommend reading about the python library called “configparser” and start using them in your code straight away

Now back to logging:

Python logging provides three ways to configure logs using config files. In this short write-up I am only going to cover the INI file format. More information can be found here

Lets built the logging example in my previous post using config files

To do that lets create a logging_config.ini file and place it next to your .py file i.e. in the same folder.

Very important to note here

  1. We have to configure the root logger in the .ini file even though we aren’t using it. This is because of the way fileConfig() works – it is required to include the root logger. As an alternative, python docs recommend to use the dictConfig() API as it will be the one that will be enhanced in the future. i.e. to say the fileConfig() will become legacy and may be deprecated in the future. You can find more information here

  2. So far so good, but if you don’t realize this will output the log messages (at least) twice. This is more to do with logger hierarchies – which we will talk about in the next writeup.
    For now lets realized and accept two facts:

2.1     every loggers that we create using getLogger() is a child of the “root” logger (more on this when we talk about hierarchies)
2.2     all loggers propagate their log messages to the parent loggers unless

So, to output the log message only once i.e. using only our custom logger we have to mention “propagate” entry in the .ini file for that logger. This can be done programatically as well using “logger.propagate = False”

The logging_config.ini file that I pasted below has the following parts:

  1. [loggers] Provide reference placeholders for names of loggers. Such references are used in the file to define properties
  2. [handlers] Provide placeholders for handlers
  3. [logger_XXX] Define properties for the logger references by XXX
  4. [handler_Y] Define properties for the handler referenced by Y
  5. [formatter_Z] Define properties for the formatter referenced by Z







format=%(asctime)s %(levelname)8s %(name)s | %(message)s

For most part the config file should be self explainatory, but if you need more information or have a doubt, feel free to leave a comment and I will try to answer to the best of my ability.

The code in .py file will look like following:

import logging
from logging.config import fileConfig


logger = logging.getLogger('module1')'logging info at module level')

class c1:
def __init__(self):
logger = logging.getLogger('module1_c1_init')
logger.warning('logging warn for class c1 init method')

def __call__(self, *args, **kwargs):
logger = logging.getLogger('module1_c1_call')
logger.warning('logging warn for class c1 call method')

if __name__ == '__main__':
o1 = c1()

### This outputs :
# 2018-07-13 11:01:33,628 INFO module1 | logging info at module level
# 2018-07-13 11:01:33,628 WARNING module1_c1_init | logging warn for class c1 init method
# 2018-07-13 11:01:33,628 WARNING module1_c1_call | logging warn for class c1 call method

Happy mastering logging!

Up next: Effective (and easy) logging in python (part-3) will cover logger hierarchies

Better logging on AWS lambda using a custom logger #python

AWS suggests logging according to

Problem with this i.e. logging.getLogger()

is that it does not print the module/class name as desired in the logs. (It uses the root logger and prints some weird codes)


For a function:

def __init__():    
    logger = logging.getLogger()    
    logger.warning('logging warn for class c1 init method')

AWS log looks like this:

[WARNING]   2018-07-09T06:34:58.854Z    334b1941-8342-11e8-9b02-4d38ca4833e5    logging warn for class c1 init method

Clearly, I have no cluse what that gibrish hashcode means. It gives me no indication of where the log message came from

Now, if we add a custom logger

hoping that I will now be able to get some more meaningful logging, the code looks like:

def __init__():
    h = logging.StreamHandler()
    h.setFormatter(logging.Formatter('%(asctime)s %(levelname)8s %(name)s | %(message)s'))
    logger = logging.getLogger('module1_c1_init')   
    logger.warning('logging warn for class c1 init method')

AWS log looks like this:

2018-07-09 06:34:58,854 WARNING module1_c1_init | logging warn for class c1 init method
[WARNING]   2018-07-09T06:34:58.854Z    334b1941-8342-11e8-9b02-4d38ca4833e5    logging warn for class c1 init method

Wow!! AWS now prints the logs twice! Unnecessary clutter!

So what the solution?

Use logger.propogate=False

So, with the following code

def __init__():
    h = logging.StreamHandler()
    h.setFormatter(logging.Formatter('%(asctime)s %(levelname)8s %(name)s | %(message)s'))
    logger = logging.getLogger('module1_c1_init')
    logger.propagate=False      #-- added this line
    logger.warning('logging warn for class c1 init method')

AWS log looks like:

2018-07-09 06:34:58,854 WARNING module1_c1_init | logging warn for class c1 init method

Ah! Much better!

Effective (and easy) logging in python

Four steps to mastering logging in python:

  1. create a logger
  2. create a handler and set the formatter
  3. add handler to logger
  4. set level of loggwer
    …and log away to glory!!
import logging

# create a logger @module level
logger = logging.getLogger('module1')
# handler and formatted
h = logging.StreamHandler()
h.setFormatter(logging.Formatter('%(asctime)s %(levelname)8s %(name)s | %(message)s'))
# handler -> logger
# level -> logger
# log away!!'logging info at module level')

class c1:
    def __init__(self):
        # create a logger @class level
        logger = logging.getLogger('module1_c1_init')
        logger.warning('logging warn for class c1 init method')

    def __call__(self, *args, **kwargs):
        # create a logger @object level
        logger = logging.getLogger('module1_c1_call')
        logger.warning('logging warn for class c1 call method')

if __name__ == '__main__':
    o1 = c1()   # calls __init__
    o1()        # calls __call__

### This outputs :
#   2018-07-09 14:31:36,327     INFO module1 | logging info at module level
#   2018-07-09 14:31:36,327  WARNING module1_c1_init | logging warn for class c1 init method
#   2018-07-09 14:31:36,327  WARNING module1_c1_call | logging warn for class c1 call method

Creating a python package and installing using pip (using pycharm and v-env)

Setup a new python project that we will package later

I typically use pycharm and create a new virtual environment (under the folder ‘venv’ – see below) that I use as the interpreter for that project:

  1. After you create a new project on pycharm with Name of project =
  2. Create a new package and name it
  3. Finally, create an empty
  4. Don’t forget to write you code in files

The folder structure should look like below:


Paste the following into

import setuptools


Create the package

Go to Tools->Ru>sdist ENTER

** Got an error? ** Checkout the gotcha section at the bottom

This will create two folders:
1. dist
2. name_of_package.egg-info

Checkout the dict folder, it will contain the package nicely compressed!

Install the newly created package from local project:

$ pip install dynamodb_dataframes --no-index --find-links file://C:\Users\MannH\PycharmProjects\name_of_package\dist

Finally, upload your package into pypi

Google it 🙂 it’s that simple!


1. If you get this error while creating a sdist “error: package directory ‘venv\Lib\site-packages\pip-10\0\1-py3\6\egg\pip’ does not exist”

Just rename the folder from “pip-10.0.1-py3.6.egg” to “pip-10-0-1-py3-6-egg”. This folder is located under “venv\Lib\site-packages\”


1. If you push your project to github, you can install it directly from github:

pip install git+

Multiprocessing in python

A simple program that show how to use python to run tasks in parallel (on multiple CPUs of the same node) using the multiprocessing module.

from multiprocessing.pool import Pool
import os
import time
import logging

def f(x):
print(os.getpid()) # for information
time.sleep(3) # mimic an operation that takes a lot of time
return x*x # return result of the operation

if __name__ == '__main__':
logger = logging.getLogger(__name__)"Entering parallel execution")
p = Pool(4) # create 4 worker processes
z = p.map_async(f, [1,2,3,4,5,6]) # 6 tasks will run asynchronously on 4 workers
logger.error("Could not complete the multi processing operation")

How to perform logging in a Spark Application

Typically I want to set the spark logger log-level to ERROR, but at the same time have a log-level of INFO for my application classes. This can be achieved doing the following:

import org.apache.spark.internal.Logging
import org.apache.log4j.{Level, LogManager}

    val spark = SparkSession.builder().appName("appName").master("local[*]").getOrCreate()

    logInfo("<<<<<<<<<<<<<<< info message >>>>>>>>>>>>>>>>>>>")

A real life use case for stateful streaming processing using Apache Spark part-1 of 2

Getting the basics for spark streaming is easier than you might think – thanks to the easy to use API. There are multiple good tutorials out there to get you started. But how about upping it up a bit?

Taking Spark Streaming beyond word count

> In this writeup, I am going to cover a slighly complex use case that requires dealing with both the following at the same time
>> (1) arbitrary stateful streaming processing – supported by spark using mapGroupsByKey API
>> (2) maintaining the size of the state – using custom logic

To fully benefit from the time that you are going to spend here, I strongly recommend the following:
– Give it a try yourself: please read the scenario and try to build a working solution yourself
– Familiarize yourself with the following terms: processing time, event time, window, pane, watermark, trigger, lateness
– Obtain rudementary knowledge of scala and spark streaming API

The scenario / problem statement

  1. There exist multiple sensors that continuously send events. These events are key value pairs consisting of a sensor-id and a reading
  2. It is required to alert the users if the average reading for any of the sensors over last 5 seconds goes below the previous average for the same time period
  3. These windows are defined as tumbling windows based on the processing time. I.e. we are only concerned with the time at which the events reache the streaming engine and then windows them in 5 sec panes
  4. The output alert should at least include the sensor-id, previous average and new average
  5. Further, if no events is reveived from a sensor for 15 seconds, (1) an alert should be issued and (2) start from average = 0 from the next event whenever it is received [This task cannot be achieved by Spark 2.3 – read on to understand why]
  6. Modified : Alert if no events has been received since a minimum of 15 seconds for key=k1 and events have been receded for other keys [This is achievable using spark 2.3]

Appreciating the complexity of the problem statement

It is obvious we have to maintain the state of events for each key. This state must hold data for at least 2 windows – the current and the last.

The spark API provides a wonderful operation called mapGroupsWithState that helps us load and off-load the state for a particular key. But this API will not be able to sufficient to solve our problem. If we group by key+window, there wont be a way left to compare 2 windows of the same key. Below you can find a high level flow of the logic we will build:

 1. read the input stream [consists of (streamID, value)]
 2. add the current timstamp to it
 3. window the events in 5 minute panes on the current timestamp
 4. group by key=sensorId
 5. call the mapGroupsWithState
 6. if previous state of a key doesnt exist,
 6.1. group by window and store the aggregates (sum, no of events, average)
 6.2. set timeout duration for key
 6.3. return an alert = false
 7. id previous state of a key exists,
 7.1. group by window and store the aggregates (sum, no of events, average)
 7.2. combine this value with previous state
 7.3. remove older windows ( remember we are only interested in current and last window)
 7.4. reset timeout duration for key
 7.5. return an alert by comparing the averages between windows
 8. if timeOutDuration occurs
 8.1. drop all states of the key
 8.2. return an alert = true

Do note that we have to build a custom logic as described in 7.3, to remove unecessary state held in the memory.

Setting up test cases

Test case 1: When trigger= 1 second

Data generated at (in sec) Program logic for each sensor key
t0 no alert for this window=w1
t1 = t0 + 4 data may be part of w1 or a new window w2 [assume same window w1]
t2 = t1 + 6 data is put into w2. alert if w1.avg < w2.avg
t3 = t2 + 35 w1 and w2 are dropped since older than 15 sec. data put into w3. alert if data was not received for this key

Test case 2: When trigger= 30 second

Data generated at (in sec) Program logic for each sensor key
t0 nothing happens
t1 = t0 + 4 nothing happens
t2 = t1 + 6 all data for t0, t1 and t2 is put into w1
t3 = t2 + 35 w1 is dropped since older than 15 sec. data put into w2. alert if data was not received for this key

Importance of trigger operation on a streamwriter

The trigger time defined on the streamWriter “triggers” governs the frequency of transformations on the input data that is being received by the engine.
The processing time of an event depends on when it enters the processing engine. Thus, a trigger values governs the processing time for events. Too big a value and we will could be missing on alerts. Too big a value and we could be getting too many alerts.

An illustration: Taking our use case scenario and the test case data defined above. If the trigger is set to 100ms, when events come into the processor at time t2 and t2+1sec, while both these events are part of one window, they could trigger mutliple alerts if avg@w1 < avg@w2@t1 and avg@w1 < avg@s2@t1+1sec

I wrote a small piece about this here if you are keen to read.

In the next writeup, I will post the actual code for the solution as well as setting up a mock stream all within your IDE

Does it really matter if Spark Streaming uses micro batching

In this writeup we will get a hang on the importance of trigger in the context of Streaming.

Time required to read: 0.5 to 1 minute

Data flow of the streaming processor

source              = producer
input               = data entering into the processor
intermediate data   = transformed data/state held before finally writing into the 
                      result data (e.g. average over windows needs to store sum 
                      and numb of elements)
result data         = resulting data (e.g. average over a window)
output              = data exiting the processor
sink                = consumer of processed stream

How a Spark streaming processor works

 --input--{@trigger}--> {intermediate data} -> {result data} -->output--> [sink]

Figure 1: Note that the text in {} is essentially inside the streaming engine
Allowed minimum value of @trigger = 100ms

How a continuous streaming processor works (e.g dataflow)

 --input--> {intermediate data} -> {result data} --{@trigger}-->output--> [sink]

Figure 1: Note that the text in {} is essentially inside the streaming engine
Allowed minimum value of @trigger = 0

Notice the difference

Essentially the difference is about the point when the trigger occurs.

In spark, this trigger governs the size of the micro-batch and the frequence of write into the sink
In dataflow, this trigger governs the frequence of write into the sink

One drawback of using micro-batches is when the event is processed using processing-time rather than event-time. In which case, since spark is using microbatches the real timestamp at which the event arrived at the processing engine is not available.