Use Documentation to make your Machine Learning Pipelines Smooth
If your organization works with Data then this is the single greatest ROI decision you will make
Machine Learning has been a game-changer in multiple industries. The ability to take large amounts of data, and extract meaningful information from it has been a game-changer. Machine Learning also allows us to handle complex, high-dimensional data and discover relationships between the features. This has a lot of valuable applications for businesses.
This has led to a large uptick in organizations trying to implement data into their own processes. Data-Driven Decision making has consistently proven to be one of the best decisions an organization can make. However, as an organization scales up its Data Pipelines, one element that is often overlooked is documentation. Creating good documentation has some upfront costs associated with it, but will pay itself in spades (and very quickly).
In this post, I will break down how you can write good Documentation for your Machine Learning/Data Science Pipelines. If your group has any aspirations of working with Data Analysis, Machine Learning, Big Data, AI, or Deep Learning, this article will tell you how to write good documentation for your pipelines. We will cover the different aspects your documentation should cover, and how to cover each of these aspects individually. Let’s get into it.
Why good data documentation matters for Tech
Before we get into the specifics of how to write good documentation, let’s briefly cover why good documentation matters. This will allow us to have guiding principles for our documentation. Remember, once you establish your “why”, it is easier to figure out your how.
So why does good documentation matter in Software Engineering (and Machine Learning in particular)? After all, it can be expensive and time-consuming for a developer to go through and spend all those hours writing the documentation. Good documentation allows for the following benefits-
- Helps people be on the same page. Having good documentation allows people from different teams all have a common understanding.
- Makes the vision and plans clear. The correct actions vary based on an organization's plans, vision, and constraints. Having detailed documentation will allow everyone to figure out the next steps better. Remember, it’s hard to see a target you can’t hit. Documentation makes your targets a lot more concrete.
- Reduces the onboarding time. Every time I work on a new project, the first thing I do is look through what has been done already. This involves pouring over the methods, information about the data collected, ML pipelines rationale, etc. Having good documentation will severely reduce the time I would otherwise spend catching up.
If I had to summarize the benefits in a word, it would be clarity. Good documentation adds a lot of clarity across the board. This will save your business a lot of developer hours that would have otherwise been wasted looking up things repeatedly.
In an increasingly remote world, asynchronicity becomes the norm. Fantastic Documentation promotes asynchronicity. If you hate needless meetings, then promote thorough documentation. This will allow you to cut a lot of dead time going over the same ideas repeatedly. If you want to be a remote worker, then learning documentation and/or these other skills will allow you to thrive.
The details of good documentation for Machine Learning.
So now to answer a multi-million dollar question, “How to write Good Documentation in Machine Learning?” Let’s get into it. To have good documentation you need to address the following areas-
- Vision for Company and Products (Trust me this is very important)
- Resource/Situational Constraints
- Data Sources Used, Datasets available, and processing done
- Projects currently in progress
- The actual code you have
Let’s address each of these individually
1: Company Vision and Plans
This should always be the number one priority within a company. As mentioned earlier, you can only hit the target you see. Having a clear direction and purpose will save your group a lot of energy wasted running around like a headless chicken.
So what does this look like? Documentation should clearly outline each product developed or in development. The business cases, how those products integrate into the larger ecosystem, and the ideal customer that your business/group is targetting. This might seem like something for the finance bros, but these are key for developers. Remember, in the end, we have to develop products that are useful to society in order to generate long-term value. Not knowing who you’re building a product for is setting yourself up for failure.
Let me drill this point home with an example of a Large Scale Automated Data Analysis company. This company takes data from their clients and does some analysis for them and returns nice insights. If this sounds familiar to my regulars, this is because the company I helped build, Clientell, does exactly this.
There are two ways our new hypothetical company can go. One is to try serving a lot of clients and make money by achieving a lot of volume (tons of clients/orders). The other is to only work for a few high-ticket clients and build very heavy solutions geared towards these clients. Either way will allow you to build a thriving business. However, the engineering challenges for each are very different. Having a clear vision will help your whole organization move towards the right goals.
2: Resource/Situational Constraints
With a clear vision, you also need to outline the constraints your team is currently working with. These constraints might be physical/resource-oriented (lack of manpower, cloud computing, finances), or domain-oriented (rules and regulations). They might even be self-imposed (meet certain baselines, use certain tools/solutions, integrate within a framework). Making these clear will be crucial, and all documentation should cover these.
3: Information about Data
Every time an organization tells me they don’t have this, I shake my head. Any serious Data Processing/Analysis/AI Company should do this. Your documentation should cover information about the data sources used, what the pipeline looks like, and what kind of processing is being done to features/information from our raw data.
Each feature being used for the Data Science/Machine Learning should have it’s own breakdown with information about its nature (Categorical/Boolean/Numerical etc), rationale for using it, expected range/distribution. This would also be a good place to document any priors and how you came to them.
Projects currently in progress
An engineer working on one project should be able to look up other projects also in the pipeline. This can help developers develop a birds-eye view of the organization and is a must for building cross-team collaborations. It can also help your engineers build solutions with the big picture in mind, which will pay many dividends.
The actual code you have
When most engineers think of documentation, this is all they think of. There are tons of good guides to writing good, well-documented code. However, I will cover some basic principles here. If you want a more in-depth breakdown of this, let me know and I can cover it.
Picking the low-hanging fruit, all variable/method names should be descriptive. Your function docstring should contain describe the variables, what the function does, and the return value. I also write a description for each class, so anybody can estimate the class immediately. This is an example that I like to give my readers:
You also want to follow clear design patterns like The Single Responsibility Principle. These lead to code that is easy to read, understand and modify. That will allow you or others to work on older code that you wrote with minimal effort wasted in trying to figure out your thoughts/flow.
That’s it for this article. If you’re looking to get into ML, this article gives you a step-by-step plan to develop proficiency in Machine Learning. It uses FREE resources. Unlike the other boot camps/courses, this plan will help you develop your foundational skills and set yourself up for long-term success in the field.
For Machine Learning a base in Software Engineering, Math, and Computer Science is crucial. It will help you conceptualize, build, and optimize your ML. My daily newsletter, Coding Interviews Made Simple covers topics in Algorithm Design, Math, Recent Events in Tech, Software Engineering, and much more to make you a better developer. I am currently running a 20% discount for a WHOLE YEAR, so make sure to check it out.
I created Coding Interviews Made Simple using new techniques discovered through tutoring multiple people into top tech firms. The newsletter is designed to help you succeed, saving you from hours wasted on the Leetcode grind. I have a 100% satisfaction policy, so you can try it out at no risk to you. You can read the FAQs and find out more here
Feel free to reach out if you have any interesting jobs/projects/ideas for me as well. Always happy to hear you out.
For monetary support of my work following are my Venmo and Paypal. Any amount is appreciated and helps a lot. Donations unlock exclusive content such as paper analysis, special code, consultations, and specific coaching:
Venmo: https://account.venmo.com/u/FNU-Devansh
Paypal: paypal.me/ISeeThings
Reach out to me
Use the links below to check out my other content, learn more about tutoring, or just to say hi. Also, check out the free Robinhood referral link. We both get a free stock (you don’t have to put any money), and there is no risk to you. So not using it is just losing free money.
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
If you’re preparing for coding/technical interviews: https://codinginterviewsmadesimple.substack.com/
Get a free stock on Robinhood: https://join.robinhood.com/fnud75