Start structuring your code like a software engineer

You’ve entered the data scientist role and nobody told you that you actually have to write code like a software engineer? It’s classic. CS and data engineers complain that your code isn’t written “the right way”¹. Worst of all, you know they’re onto something, but nobody can help you apart from saying you’ll have to learn software engineering.

I’ll show you a design technique that will help you connect your presentational skills with creating better code structure. Reading a lot of code will help you get better, but you don’t have to get another degree to start structuring your code better.

I’ll also show you how to use that technique on real code examples I’ve found in production.

Progressive disclosure

Progressive disclosure has nothing to do with software engineering. It’s coming from interaction design. Here is the best explanation I’ve found.

“Progressive disclosure is an interaction design pattern that sequences information and actions across several screens (e.g., a step-by-step signup flow). The purpose is to lower the chances that users will feel overwhelmed by what they encounter. By disclosing information progressively, interaction designers reveal only the essentials, and help users manage the complexity of feature-rich websites or applications.” (Interaction Design Foundation)

It doesn’t say anything about the code, but having this technique in mind while structuring the code will help you end up with a more readable code. It will help you achieve two properties of the readable code that:

Reveals information from abstract to specific. It does not overwhelm the reader.
Reads nicely, like a story, allowing you to jump into specific parts if you’re just interested in details.

The nicely structured code reads like a good reference book. You read titles and descriptions of all chapters and you know what the book is about. If you want to know more about a specific topic, you read the chapter on that specific topic. Or part of the chapter. You don’t have to read the chapter about RedShift’s execution engine if you just want to know which windowing functions are supported.

It’s not a completely new skill for you, though. You’re already doing this when presenting your analyses.

Imagine a presentation structured the same way a lot of data science code is structured. A linear report on the process.

Slide 1: For this analysis, we’re using RedShift. It has a lot of tables.
Slide 2: Unfortunately, we don’t have a record of sent invoices, but here is the SQL query I’ve used to approximate that.
Slide 3: Here is the second page of the SQL query. Notice that I had to use 4 joins to do that. It took a while to get the data.
Slide 4: Here are results shown in the table.
Slide 5: Here is the summary of our churn in one specific segment of our customers.
Slide 6: Here is the chart of our net new revenue.

It just doesn’t make sense. No one structures their presentations like that.

Instead, you start from the top, and then drill into specific areas. If you don’t have enough time, you can skip sections and still deliver the main message.

A better structure:

Slide 1: Here is our net new revenue chart
Slide 2: Let’s break it down by expansion, contraction, churn, and new customers.
Slide 3: Here is the breakdown of contraction by reason and customer segments.
…
Slide n: Thank you (and small link to technical implementation of analysis in case someone wants to drill further themselves).

Structuring the data science code with progressive disclosure in mind

When the structure doesn’t naturally come out of the code I write, I ask myself a set of questions to help me find the structure. They are by no means comprehensive nor tackle completely different angles, but they help me remove common problems that make the code more difficult to read.

Q: How can I describe the piece of code to somebody else?

Let’s take a look at this piece of code. Try to get a general feeling of what it’s doing, not the details.

def predict(execution_date, templates_dict, **context):
    db = PostgresHook(postgres_conn_id='redshift')

    query_annual_deals = 'SELECT * FROM dw.ltv_annual_deals'

    annual_deals = db.get_pandas_df(query_annual_deals)
    logging.info('Got %d results for annual deals', len(annual_deals))

    annual_deals.sort_values(by=['app_id'], inplace=True)
    annual_deals.reset_index(drop=True, inplace=True)
    annual_deals.set_index('app_id', inplace=True)

    query_transactions = 'SELECT * FROM dw.ltv_invoices'

    transactions = db.get_pandas_df(query_transactions)
    logging.info('Got %d results', len(transactions))

    # sorts and indexes dataframe
    transactions.sort_values(by=['app_id', 'invoice_date'], inplace=True)
    transactions.reset_index(drop=True, inplace=True)

    # truncates 'cohort_date' on month into 'cohort'
    transactions['cohort'] = transactions['cohort_date'].apply(lambda d: '%04d-%02d' % (d.year, d.month))

    # calculates 'period', which is month difference between 'cohort_date' and 'invoice_date'
    transactions['period'] = (transactions['invoice_date'].dt.to_period('M') - transactions['cohort_date'].dt.to_period('M')).apply(lambda p: p.n)

    # checks if this 'period' has completed for all members of 'cohort'
    transactions['is_period_complete'] = transactions['invoice_date'].dt.to_period('M') < pd.to_datetime('now').to_period('M')

    # ...processing logic

One possible description of the code is “it loads annual deals, then sorts the data frame and sets the index. Then it loads all the invoices, then sorts loaded invoices invoices and sets the index.”
It’s a very robotic and unnatural one, but the structure is here to point something out.

“It loads annual deals, then sorts the data frame and sets the index. Then it loads all the invoices, then sorts loaded invoices and sets the index. Then it calculates the cohort and sets period information”.

Let’s rewrite the code to read like the sentence above.

def predict(execution_date, templates_dict, **context):
    annual_deals = load_annual_deals_from_redshift()
    sort_and_set_annual_deals_index(annual_deals)

    transactions = load_transactions_from_redshift()
    sort_and_set_transactions_index(annual_deals)
    set_cohort_and_period_information(annual_deals)
    # ...processing logic

This reads better already. If someone wants to understand how I load annual deals, they can read the function implementation. Same for sorting and setting indices.

Different explanations of what the code does will transform the same code into different shapes, but you’ll likely come up with a similar explanation.

Q: Can I group some operations together? Is it a detail or very important piece of the story?

Another question that helps me is whether I can group things together. Same as designers would group actions operating on the same product area, we can group together the code that’s contributing to the same more abstract process.

For example, sorting annual deals and setting the index looks like a detail about the loaded dataframe. It can be done while loading from redshift.

The structure I’d use for that code is

def load_annual_deals_df():
    db = PostgresHook(postgres_conn_id='redshift')
    query_annual_deals = 'SELECT * FROM dw.ltv_annual_deals'
    annual_deals = db.get_pandas_df(query_annual_deals)
    annual_deals.sort_values(by=['app_id'], inplace=True)
    annual_deals.reset_index(drop=True, inplace=True)
    annual_deals.set_index('app_id', inplace=True)
    return annaul_deals

def load_transaction_deals_df():
    db = PostgresHook(postgres_conn_id='redshift')
    query_transactions = 'SELECT * FROM dw.ltv_invoices'
    transactions = db.get_pandas_df(query_transactions)
    transactions.sort_values(by=['app_id', 'invoice_date'], inplace=True)
    transactions.reset_index(drop=True, inplace=True)
    return transactions

def predict(execution_date, templates_dict, **context):
    annual_deals = load_annual_deals_df()
    logging.info('Got %d results for annual deals', len(annual_deals))

    transactions = load_transaction_deals_df()
    logging.info('Got %d results for transactions', len(transactions))
    set_cohort_and_period_information(annual_deals)
    # ...processing logic

You can see that I don’t use load_annual_deals_from_redshift and sort_and_set_annual_deals_index. Why is that?

load_annual_deals_df and load_transction_deals_df are still short and easy to read. They group together everything around loading the dataframe and preparing it for the rest of “predict” function. You can explain it as “it loads the annual deals from redshift and returns it in a dataframe.”
There’s a cost of extracting many small functions. Machines won’t care that much, but your reader will have to jump between functions. I reckon it doesn’t overwhelm the reader when read. If the code starts piling up in that function, I’ll look into breaking it up.

Can you go too abstract?

You can apply this rule at different zoom levels and go abstract all the way. For example, you can structure it like

def predict():
    annual_deals, transactions = load_data()
    # ...processing

That’s still readable. We know that annual deals and transactions are results of loading the data.

What doesn’t bring much value is this:

def predict():
   annual_deals, transactions = load_data()
   process(annual_deals, transactions)

Process is too generic for a function name. It doesn’t mean anything. Programs are all about processing the data. It just added another step while exploring the code. Imagine you’ve clicked on a signup button and the first step of the flow was “This is a signup flow. You will have to type in your email address in the next step.”

Q: Do I repeat this code multiple times?

Another helpful question is whether the code is written multiple times. If it is, it’s possible that it represents something more generic that’s worth extracting.

An obvious example from the code I showed you is loading the dataframe from RedShift. Every time we load it, we write three following lines:

db = PostgresHook(postgres_conn_id='redshift')
query_annual_deals = 'SELECT * FROM dw.ltv_annual_deals'
annual_deals = db.get_pandas_df(query_annual_deals)

We can extract it into a method

def load_df_from_redshift(query):
   db = PostgresHook(postgres_conn_id='redshift')
   return db.get_pandas_df(query)

annual_deals = load_from_redshift('SELECT * FROM dw.ltv_annual_deals')

Q: Does this belong here? Am I doing something similar in other places?

When I experiment in Jupyter notebooks, I don’t know what exactly I’ll need when I start writing the code. Here’s an example I’ve found:

events = load_from_redshift(f"SELECT * FROM events WHERE 1 = 1 AND ${extra_conditions}")
events['metadata_parsed'] = events['metadata'].apply(json.loads)
events['suggestion_type'] = events['metadata_parsed'].apply(extract_type_from_metadata)

# some other logic that does not change events['metadata_parsed']

events['trigger_type'] = events['metadata_parsed'].apply(extract_trigger_type_from_metadata)

Extracting trigger type and suggestion type are very similar. They’re extracting something from the metadata. In this case, I would not extract it in a function yet, but I’d definitely move them closer to each other.

events = load_from_redshift(f"SELECT * FROM events WHERE 1 = 1 AND ${extra_conditions}")
events['metadata_parsed'] = events['metadata'].apply(json.loads)
events['suggestion_type'] = events['metadata_parsed'].apply(extract_type_from_metadata)
events['trigger_type'] = events['metadata_parsed'].apply(extract_trigger_type_from_metadata)

# some other logic that does not change events['metadata_parsed']

Q: Is this one-liner easy to read?

Common pattern I see in DS code is packing a lot of logic into df[col_name].apply(lambda x: ..... It often forces the reader to read the implementation and then figure out the meaning from it.

Here’s an example from the code you’ve already seen:

# calculates 'period', which is month difference between 'cohort_date' and 'invoice_date'
transactions['period'] = (transactions['invoice_date'].dt.to_period('M') - transactions['cohort_date'].dt.to_period('M')).apply(lambda p: p.n)

Note that it comes with a comment. The comment is here to help you understand what’s happening. It allows you to skip reading the implementation of it, which is good (progressive disclosure!).

There is a better way to do the same, though

transactions['period'] = months_between(transactions, 'invoice_date', 'cohort_date')

def months_difference(df, col1, col2):
    return (df[col1].dt.to_period('M') - df[col2].dt.to_period('M')).apply(lambda p: p.n)

There is no need for the comment anymore. If a reader wants to read the implementation, they will find the function and read it.

Where to go from here?

I’ve explained the main principle behind structuring the code to read better. Having that principle in mind, it’s a “rinse and repeat” exercise. Write the code, read it, and refactor it by asking these questions. Find the inspiration in code someone else has written. Try using the same patterns in your code.

Extract some functions into separate .py files to lighten the notebook. Don’t force the reader to read all extracted methods before it comes to the main analysis code.

And don’t forget that us engineers are often opinionated about the code. Sometimes we’ll argue because we prefer different styles, but we’ll have no good arguments to defend it. :)

I often hear phrases like “the right way”. Such thing means nothing without the context. There are some properties that are good to maintain as long as the environment allows — code readability is one of them. ↩

Start structuring your code like a software engineer

Progressive disclosure

Structuring the data science code with progressive disclosure in mind

Q: How can I describe the piece of code to somebody else?

Q: Can I group some operations together? Is it a detail or very important piece of the story?

Can you go too abstract?

Q: Do I repeat this code multiple times?

Q: Does this belong here? Am I doing something similar in other places?

Q: Is this one-liner easy to read?

Spend less time fighting tools,
more time doing the REAL work!

Where to go from here?

More about the same topic

How to make startup scripts for Jupyter kernels reliable?

Transform exploratory Jupyter notebook into production friendly code: step one

Progressive disclosure

Structuring the data science code with progressive disclosure in mind

Q: How can I describe the piece of code to somebody else?

Q: Can I group some operations together? Is it a detail or very important piece of the story?

Can you go too abstract?

Q: Do I repeat this code multiple times?

Q: Does this belong here? Am I doing something similar in other places?

Q: Is this one-liner easy to read?

Spend less time fighting tools,more time doing the REAL work!

Where to go from here?

More about the same topic

How to make startup scripts for Jupyter kernels reliable?

Transform exploratory Jupyter notebook into production friendly code: step one

Spend less time fighting tools,
more time doing the REAL work!