Interesting Content in AI, Software, Business, and Tech- 12/06/2023
Content to help you keep up with Machine Learning, Deep Learning, Data Science, Software Engineering, Finance, Business, and more
A lot of people reach out to me for reading recommendations. I figured I’d start sharing whatever AI Papers/Publications, interesting books, videos, etc I came across each week. Some will be technical, others not really. I will add whatever content I found really informative (and I remembered throughout the week). These won’t always be the most recent publications- just the ones I’m paying attention to this week. Without further ado, here are interesting readings/viewings for 12/06/2023. If you missed last week’s readings, you can find it here.
Reminder- We started an AI Made Simple Subreddit. Come join us over here- https://www.reddit.com/r/AIMadeSimple/. If you’d like to stay on top of community events and updates, join the discord for our cult here: https://discord.com/invite/EgrVtXSjYf
Community Spotlight- The Fall of Civilizations Podcast
Some of you have asked me for podcast recommendations. I generally don’t engage with too many podcasts, especially because most of them have a very high variance in quality. The Fall of Civilizations Podcast is a rare exception. It explores the collapse of different societies through history in beautifully high-effort documentaries/lectures. Videos contain music, art, and poetry from the cultures, giving them a nice human touch. If you’re a history nerd like me- highly recommend it. If you’re not, this podcast will turn you into one.
If you’re doing interesting work and would like to be featured in the spotlight section, just drop your introduction in the comments/by reaching out to me. There are no rules- you could talk about a paper you’ve written, an interesting project you’ve worked on, some personal challenge you’re working on, ask me to promote your company/product, or anything else you consider important. The goal is to get to know you better, and possibly connect you with interesting people in our chocolate milk cult. No costs/obligations are attached.
These are pieces that I feel are particularly well done. If you don’t have much time, make sure you at least catch these works.
I appreciate Charley Johnson for his fresh takes. Whether or not I agree with him, his work usually leaves me thinking about something differently. Other times, he’s really good at articulating what I know, but can’t bring out of the tip of my tongue.
“I was a civil servant in the federal government for eight years. .. The first is the ‘tech for good’ mindset I’ve written about before that “leads to the definition of a problem that requires a technology solution. It becomes a trojan horse for scaling the technology, and the interests and beliefs of its creators.” This is doubly seductive when we create organizational constructs that consider technology a vertical, and prefix team names with the word ‘digital’. But I digress.
The second mindset believes that systems are ordered …
At USAID, something called ‘the logical framework’ guided the design of every program, and it nurtured the belief that systems are ordered. In these frameworks, we would detail how inputs lead to outputs, and then how outputs would collectively achieve the program’s purpose. Essentially, we thought it was right to assume that we could align our interventions to specific outputs.
In actual fact, many systems are unordered so we have to adjust our assumptions and decision-making processes accordingly.”
This article seems to echo a lot of what I’ve been picking up based on conversations/work in RAG. Thinking that slapping on a Vector DB onto your RAG problem is not going to do much if you don’t have a well-setup around it.
“Vector databases (Pinecone, Milvus, etc) have risen in popularity lately as a means of storing and computing nearest neighbor on a large collection of documents. In this post, I’d like to make the case that you do not need a vector database for RAG.
The task of finding a small subset of relevant documents to answer a question is a well studied area of information retrieval (IR), a field of computer science. The solutions here predate vector databases.
The most obvious examples of systems that have implemented scalable solutions to this problem are search engines like Google, Bing, Apache Lucene, Apple Spotlight, and many others. As an industry, we’ve already created & iterated on highly scalable and highly available technologies using reverse indexes over the last few decades.
While semantic vectors are absolutely a great innovation in the field, they should be used and implemented in the context of the lessons we have learnt building scalable IR systems.”
A while back, I wrote a piece on the actual societal risks from AI, and mentioned that the insane energy requirements from running these models at scale was a huge problem (one that people were overlooking). Sasha Luccioni, PhD and crew wrote an excellent paper that brings numbers to the conversation. Hopefully, this puts some cold water to the reckless adoption of Gen AI.
“Recent years have seen a surge in the popularity of commercial AI products based on generative, multi-purpose AI systems promising a unified approach to building machine learning (ML) models into technology. However, this ambition of “generality” comes at a steep cost to the environment, given the amount of energy these systems require and the amount of carbon that they emit. In this work, we propose the first systematic comparison of the ongoing inference cost of various categories of ML systems, covering both task-specific (i.e. finetuned models that carry out a single task) and `general-purpose’ models, (i.e. those trained for multiple tasks). We measure deployment cost as the amount of energy and carbon required to perform 1,000 inferences on representative benchmark dataset using these models. We find that multi-purpose, generative architectures are orders of magnitude more expensive than task-specific systems for a variety of tasks, even when controlling for the number of model parameters. We conclude with a discussion around the current trend of deploying multi-purpose generative ML systems, and caution that their utility should be more intentionally weighed against increased costs in terms of energy and emissions. All the data from our study can be accessed via an interactive demo to carry out further exploration and analysis.”
And this is why we check our assumptions/axioms.
“Honeybees in man-made hives may have been suffering the cold unnecessarily for over a century because commercial hive designs are based on erroneous science, my new research shows.
More recently, California beekeepers have even been putting bee colonies into cold storage during summer because they think it is good for brood health.
But my study shows that clustering is a distress behaviour, rather than a benign reaction to falling temperatures. Deliberately inducing clustering by practice or poor hive design may be considered poor welfare or even cruelty, in light of these findings.”
🌀 Luca Rossi writes some amazing pieces for Software Leaders. This was no exception.
“So here is what we will cover today:
- 💬 Names matter — why good names go a long way, and why bad ones are worse than you think.
- 📡 Systems and services — naming strategies for your big architecture pieces, and the feud between clever and descriptive names.
- 📂 Folder structures / architectures — discussing discoverability, using context, and a critique of the screaming architecture.
- 🔧 Naming classes and functions — discussing nouns, verbs, prefixes, and grep-ability.
- 📊 Keeping consistency — good rules are only half of the job.
Let’s dive in!”
“But it would be a mistake to think of this current wave of innovation as something entirely new. Rather, this is just the next chapter in a long story about the relationship between life and information that began eons ago, shortly after the formation of the Earth.
These two topics — life and information — at first glance don’t seem to have anything in common. But upon closer inspection, they are revealed to be profoundly connected. Life requires information in order to form, and likewise, information requires life in order to propagate. It is only by understanding this connection that we can see clearly just exactly what these large language models (LLMs) really are and how they will ultimately impact our lives. Against this more expansive backdrop, the resolutions to the many debates about the role of AI in our society suddenly become clearer. “
Extremely important piece on how LLM-based search could be exploited by Sahar Mor
“Yet another announcement this week had a major impact on internet search as we know it. OpenAI announced GPTBot — its web crawler that collects internet data to improve future models. It filters sources requiring paywall access, known to collect personally identifiable information (PII) or contain text violating OpenAI’s policies. Its documentation page states that “allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.” — an altruist goal that doesn’t align with content owners. In fact, it remains unclear why wouldn’t everyone block GPTBot given the current lack of incentives. Maybe there are lessons to be learned from SEO LLM companies should consider.”
“In this essay, we analyze why RLHF has been so useful. In short, its strength is in preventing accidental harms to everyday users. Then, we turn to its weaknesses. We argue that (1) despite its limitations, RLHF continues to be effective in protecting against casual adversaries (2) the fact that skilled and well-resourced adversaries can defeat it is irrelevant, because model alignment is not a viable strategy against such adversaries in the first place. To defend against catastrophic risks, we must look elsewhere.”
“To further explore this topic, I am surveying real-world serverless, multi-tenant data architectures to understand how different types of systems, such as OLTP databases, real-time OLAP, cloud data warehouses, event streaming systems, and more, implement serverless MT. It’s inspired by the book series The Architecture of Open Source Applications that I read over a decade ago. What I loved about those books, when reading them still relatively early in my career, was seeing how other people were building software. My aim for this analysis is the same but applied to cloud-based data systems that implement multi-tenancy and a serverless model.”
Written by- Fahim ul Haq
“E-voting systems come in various shapes and sizes. Despite their variety, they all share one truth: The stakes are undoubtedly high, as a single design flaw can compromise a process that determines the direction of a democracy. Therefore voting systems require excruciatingly careful attention to design.
Among other requirements, e-voting systems must be available, secure, and scalable. They make excellent system design case studies.
Today, I’ll share 3 takeaways that any developer can learn from the state of electronic online voting — and how you can build the skills required to shape the future of e-voting.”
Cult Member Adam Haney had some great insights on leading AI teams, ethics, and more.
Abhinav Upadhyay always does really good software deep dives. His work has made me soo much better with Programming.
“In the past, I’ve written a few articles covering the internals of CPython and they have been received quite well. I want to write more in this area, but in a more systematic way, so that we have a solid foundation when we dive deeper in future articles.
In this article, I plan to cover a basic idea behind how objects (or the data types) are implemented and represented within CPython. If you look at the CPython code, you will see a lot of references to PyObject, it plays a central role in the implementation of objects in Cpython. This is a crucial detail to understand, and if we decode this part, everything else becomes simpler. So, let’s do this!”
If you find AI Made Simple useful and would like to support my writing- please consider becoming a premium member of my cult by subscribing below. Subscribing gives you access to a lot more content and enables me to continue writing. This will cost you 400 INR (5 USD) monthly or 4000 INR (50 USD) per year and comes with a 60-day, complete refund policy. Understand the newest developments and develop your understanding of the most important ideas, all for the price of a cup of coffee.
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819