Blind spots in ML : Data Engineers & Data Scientists

Prasad Paravatha
2 min readMar 4, 2021

Basic mis-understandings of each others work

Background: For the past 3 years, I have been working at the intersection of Cloud (AWS, OpenShift/Kubernetes, Docker, Snowflake), Software Engineering (UI and Microservices), and Data Science (Python, Spark, H2O, R, SAS). Its been a great learning experience working with talented engineers in building AIR9 Data Science Platform.

This post is about my observations working with Data Engineers and Data Scientists/Analysts, and their blind-spots when it comes to Machine learning projects.
Side note #1: There is an interesting back story to the word “data scientist”.
Side note #2: IMO, Data Engineer is a simplistic term, Data Platform Engineer is more accurate.

  1. Resource requirements : CPU & RAM
    Engineers:
    Why do you need instances for more than 1 hour?
    Scientists: Data science is an iterative process. It’s not a straight line, things go in circular and/or zigzag pattern.
    I’d side with Data Scientists on this as ML workloads are bursty and resource requirements fluctuate.
  2. Size of Datasets : Number of columns and records
    Scientists:
    My code worked last week on a similar instance with a very similar dataset of 1 million records, why is it not working now? Why is it taking longer to complete?
    Engineers: Are number of columns same in the dataset from last week?
    Are the column types similar? is your dataset in compressed or uncompressed format?
    I’d side with Data Engineers on this as Data scientists tend to underestimate the resource requirements w.r.t datasets (Every tool has different memory requirements: h2o needs 3–4 times RAM the size of the dataset, SASStudio doesn’t need high RAM to dataset ratios).
  3. Dependency management: Packages and Extensions
    Scientists: Why am I not allowed to download packages from the internet? Why my new packages are breaking my code?
    Engineers: Some of the Packages and Extensions have security vulnerabilities and backward compatibility issues. Circular/Cyclic dependencies are also a big factor in Python (and R) w.r.t to security and backward compatibility (Java/Scala handle this a bit better than other languages).
    I’d side with Data Engineers on this, but Data scientists are rapidly gaining knowledge on how to use them and work around this.
  4. Coding best practices:
    Engineers:
    Why did you hardcode this configuration? Why did you write this segment of code in this way?
    Scientists: I had to do it this way for data preparation, feature engineering and model tuning steps.
    I’d side with Data Scientists on this, Data Engineers are not knowledgable about the typical ML lifecycle.

--

--