Use Documentation to make your Machine Learning Pipelines Smooth

If your organization works with Data then this is the single greatest ROI decision you will make

Devansh
8 min readMay 30, 2022

Machine Learning has been a game-changer in multiple industries. The ability to take large amounts of data, and extract meaningful information from it has been a game-changer. Machine Learning also allows us to handle complex, high-dimensional data and discover relationships between the features. This has a lot of valuable applications for businesses.

Even during the highly publicized Tech Crash and Hiring Freezes, I get offers like this. To get offers like this without having to apply to a lot of online jobs, check out my article on developing your LinkedIn profile.

This has led to a large uptick in organizations trying to implement data into their own processes. Data-Driven Decision making has consistently proven to be one of the best decisions an organization can make. However, as an organization scales up its Data Pipelines, one element that is often overlooked is documentation. Creating good documentation has some upfront costs associated with it, but will pay itself in spades (and very quickly).

In this post, I will break down how you can write good Documentation for your Machine Learning/Data Science Pipelines. If your group has any aspirations of working with Data Analysis, Machine Learning, Big Data, AI, or Deep Learning, this article will tell you how to write good documentation for your pipelines. We will cover the different aspects your documentation should cover, and how to cover each of these aspects individually. Let’s get into it.

Photo by Markus Spiske on Unsplash

Why good data documentation matters for Tech

Before we get into the specifics of how to write good documentation, let’s briefly cover why good documentation matters. This will allow us to have guiding principles for our documentation. Remember, once you establish your “why”, it is easier to figure out your how.

This article was inspired by a “conversation” with Ken Jee in his video What Professional Data Scientists ACTUALLY Do

So why does good documentation matter in Software Engineering (and Machine Learning in particular)? After all, it can be expensive and time-consuming for a developer to go through and spend all those hours writing the documentation. Good documentation allows for the following benefits-

  1. Helps people be on the same page. Having good documentation allows people from different teams all have a common understanding.
  2. Makes the vision and plans clear. The correct actions vary based on an organization's plans, vision, and constraints. Having detailed documentation will allow everyone to figure out the next steps better. Remember, it’s hard to see a target you can’t hit. Documentation makes your targets a lot more concrete.
  3. Reduces the onboarding time. Every time I work on a new project, the first thing I do is look through what has been done already. This involves pouring over the methods, information about the data collected, ML pipelines rationale, etc. Having good documentation will severely reduce the time I would otherwise spend catching up.

If I had to summarize the benefits in a word, it would be clarity. Good documentation adds a lot of clarity across the board. This will save your business a lot of developer hours that would have otherwise been wasted looking up things repeatedly.

Documentation allows your team to define a target. Finding your path to that target becomes much easier after that. Photo by Afif Kusuma on Unsplash

In an increasingly remote world, asynchronicity becomes the norm. Fantastic Documentation promotes asynchronicity. If you hate needless meetings, then promote thorough documentation. This will allow you to cut a lot of dead time going over the same ideas repeatedly. If you want to be a remote worker, then learning documentation and/or these other skills will allow you to thrive.

If you’re interested in leveling up your foundational Computer Science, Software Engineering, or getting your dream job, check out my newsletter Coding Interviews Made Simple.

The details of good documentation for Machine Learning.

So now to answer a multi-million dollar question, “How to write Good Documentation in Machine Learning?” Let’s get into it. To have good documentation you need to address the following areas-

  1. Vision for Company and Products (Trust me this is very important)
  2. Resource/Situational Constraints
  3. Data Sources Used, Datasets available, and processing done
  4. Projects currently in progress
  5. The actual code you have

Let’s address each of these individually

1: Company Vision and Plans

This should always be the number one priority within a company. As mentioned earlier, you can only hit the target you see. Having a clear direction and purpose will save your group a lot of energy wasted running around like a headless chicken.

For more amazing discussions, connect with me on LinkedIn, Instagram and Twitter

So what does this look like? Documentation should clearly outline each product developed or in development. The business cases, how those products integrate into the larger ecosystem, and the ideal customer that your business/group is targetting. This might seem like something for the finance bros, but these are key for developers. Remember, in the end, we have to develop products that are useful to society in order to generate long-term value. Not knowing who you’re building a product for is setting yourself up for failure.

Let me drill this point home with an example of a Large Scale Automated Data Analysis company. This company takes data from their clients and does some analysis for them and returns nice insights. If this sounds familiar to my regulars, this is because the company I helped build, Clientell, does exactly this.

Check us out. We’re doing some exceptional work.

There are two ways our new hypothetical company can go. One is to try serving a lot of clients and make money by achieving a lot of volume (tons of clients/orders). The other is to only work for a few high-ticket clients and build very heavy solutions geared towards these clients. Either way will allow you to build a thriving business. However, the engineering challenges for each are very different. Having a clear vision will help your whole organization move towards the right goals.

2: Resource/Situational Constraints

With a clear vision, you also need to outline the constraints your team is currently working with. These constraints might be physical/resource-oriented (lack of manpower, cloud computing, finances), or domain-oriented (rules and regulations). They might even be self-imposed (meet certain baselines, use certain tools/solutions, integrate within a framework). Making these clear will be crucial, and all documentation should cover these.

3: Information about Data

Every time an organization tells me they don’t have this, I shake my head. Any serious Data Processing/Analysis/AI Company should do this. Your documentation should cover information about the data sources used, what the pipeline looks like, and what kind of processing is being done to features/information from our raw data.

Each feature being used for the Data Science/Machine Learning should have it’s own breakdown with information about its nature (Categorical/Boolean/Numerical etc), rationale for using it, expected range/distribution. This would also be a good place to document any priors and how you came to them.

Source

Projects currently in progress

An engineer working on one project should be able to look up other projects also in the pipeline. This can help developers develop a birds-eye view of the organization and is a must for building cross-team collaborations. It can also help your engineers build solutions with the big picture in mind, which will pay many dividends.

Good Machine Learning/Software Engineering is a lot more than just the programming. Keep educating yourself on all the nuances.

The actual code you have

When most engineers think of documentation, this is all they think of. There are tons of good guides to writing good, well-documented code. However, I will cover some basic principles here. If you want a more in-depth breakdown of this, let me know and I can cover it.

Picking the low-hanging fruit, all variable/method names should be descriptive. Your function docstring should contain describe the variables, what the function does, and the return value. I also write a description for each class, so anybody can estimate the class immediately. This is an example that I like to give my readers:

You also want to follow clear design patterns like The Single Responsibility Principle. These lead to code that is easy to read, understand and modify. That will allow you or others to work on older code that you wrote with minimal effort wasted in trying to figure out your thoughts/flow.

That’s it for this article. If you’re looking to get into ML, this article gives you a step-by-step plan to develop proficiency in Machine Learning. It uses FREE resources. Unlike the other boot camps/courses, this plan will help you develop your foundational skills and set yourself up for long-term success in the field.

For Machine Learning a base in Software Engineering, Math, and Computer Science is crucial. It will help you conceptualize, build, and optimize your ML. My daily newsletter, Coding Interviews Made Simple covers topics in Algorithm Design, Math, Recent Events in Tech, Software Engineering, and much more to make you a better developer. I am currently running a 20% discount for a WHOLE YEAR, so make sure to check it out.

I created Coding Interviews Made Simple using new techniques discovered through tutoring multiple people into top tech firms. The newsletter is designed to help you succeed, saving you from hours wasted on the Leetcode grind. I have a 100% satisfaction policy, so you can try it out at no risk to you. You can read the FAQs and find out more here

Feel free to reach out if you have any interesting jobs/projects/ideas for me as well. Always happy to hear you out.

For monetary support of my work following are my Venmo and Paypal. Any amount is appreciated and helps a lot. Donations unlock exclusive content such as paper analysis, special code, consultations, and specific coaching:

Venmo: https://account.venmo.com/u/FNU-Devansh

Paypal: paypal.me/ISeeThings

Reach out to me

Use the links below to check out my other content, learn more about tutoring, or just to say hi. Also, check out the free Robinhood referral link. We both get a free stock (you don’t have to put any money), and there is no risk to you. So not using it is just losing free money.

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

If you’re preparing for coding/technical interviews: https://codinginterviewsmadesimple.substack.com/

Get a free stock on Robinhood: https://join.robinhood.com/fnud75

--

--

Devansh

Writing about AI, Math, the Tech Industry and whatever else interests me. Join my cult to gain inner peace and to support my crippling chocolate milk addiction