Last night, I gave a talk at the very first Reproducibility and Productivity in Data Science (“RAPIDS”) meetup on the relative unimportance of code, the state of data management, and what we at Prodo.AI are trying to do about it.
As (almost) always, I wrote the majority of the content ahead of time, so if you weren’t there (and I know you weren’t), please take a look. I think you’ll enjoy it.
It focuses on an open-source tool we’ve been building to help us manage our own data problems, Plz, which I also encourage you to check out for all your data science and machine learning needs.
Here’s how it starts:
There’s a misconception that software developers have.
We think software is important.
Because of this, we often focus the majority of our attention towards the code that makes up the software. You know, the Python, or Java, or Ruby that the programmers write. We spend a lot of time grooming it, making it more maintainable, more readable, and generally prettier.
Unfortunately, we spend almost no time at all regarding the data that flows through our programs. Typically we’re content with a few examples for our test cases, a couple of edge cases so that we can verify our systems work in some sort of staging environment… and that’s about it. We tend to look at “production” data when there’s a bug.
This is kind of funny, when you think about it, because the software isn’t important. The data is.
Let me know your thoughts, and if you do try Plz, please file as many issues as you like. We want to help bring data science into the 21st century.
Check it out.