View this email in your browser

Subscribed by mistake? Unsubscribe

DUTC WEEKLY

Issue no. 80

April 13th, 2022

Issue Preview

1. Front Page: NEW Seminar Series!
2. News Article: Training AND Consulting
3. Cameron's Corner
4. Community Calendar

NEW Seminar Series: What You Won’t Learn on Stack Overflow

Constantly looking to Stack Overflow for your python, pandas, NumPy, and matplotlib questions?
Unsure how to conceptually approach Python’s scientific stack?
Feel like you’re not following “best practices” for Python’s scientific stack?
Struggle to understand some snippet you’ve copy/pasted into codebase?

If so, join us for this two-part seminar series!

In this seminar series, we will cover core concepts of NumPy, Pandas, and Matplotlib often left out of answers on Stack Overflow. We’ll focus on building an understanding of these packages from the ground up, rather than their rote mechanical details—instead of showing you the endless features these packages offer we’ll share with you the concepts that guide experts to solutions to various data problems.

The first session (Friday, April 22) will focus on NumPy and pandas: discussing their unifying and diverging framework in order to organize your approach when working with these packages.

The second session (Friday, April 29) will cover Matplotlib: making sense of its APIs to structure your code and give you full control of your data visualizations.

What are you waiting for? Sign up for these seminars now!

FRI, APR 22, 2022 - What You Won't Learn on Stack Overflow: NumPy & pandas

$0.00 - $89.00

Tickets

FRI, APR 29, 2022 - What You Won't Learn on Stack Overflow: Matplotlib

$0.00 - $89.00

Tickets

DUTC: Training AND Consulting

Can't get enough of our courses and seminars? If you attended one of our public sessions and think this material would benefit you and your colleagues, let us know! We deliver individualized, targeted training for teams of all sizes.

Not only that, but did you know that we offer consulting services? Whatever the Python issue you encounter, DUTC can help! Our instructors are experts in the field, and they can use their knowledge and experience to solve even the toughest of problems.

To schedule private sessions or learn more information, contact us at info@dutc.io.

Cameron’s Corner

pandas: Writing a Custom Accessor

There's no doubt that pandas is one of the most flexible tabular data analysis tool kits out there.
On top of that flexibility, has arisen a need for customized extensibility. From user-defined types to user-defined arrays pandas offers highly customizable solutions for your data problems at hand.

However, this flexibility comes at a cost: pandas objects have a vast number of methods and attributes and are likely to add even more as users request an increasing amount of features.

The first step to avoiding feature-bloat is to simply deny feature requests. However, this isn't always in the spirit of open source tooling—so, the next best alternative is to better enable users to subclass arbitrary pandas objects. Relying on inheritance is a fairly straightforward way to extend the mechanics of a given object. But then you might into issues with overlapping attribute/method names where some context is necessary to call the correct method or access the correct attribute. If I want to overwrite the .mean() method for a specific object, how can I do that while also keeping my code easily synchronized with the rest of the pandas api.

from pandas import Series

class CustomSeries(Series):
    def mean(self):
        '''
        calculates mean of the data 
        and adds mean(ing of life)
        '''
        
        return super().mean() + 42
    
s = CustomSeries([1,2,3], dtype='int')
s.mean()
# 44.0

Now if I want the prevous mean functionality, I have to explicitly convert back

Series(s).mean()
# 44.0

I guess we can also use super to access the parent class' method... but I'm not sure how much better this is in terms of clarity (we do skip reinitialization costs though!):

super(Series, s).mean()
# 2.0

As you can see inheritance works for this use-case, I am able to manipulate the underlying result of the .mean() method. But this implementation will lead to issues down the road when working with more data. For example, if I want to use other pandas methods, such as read_csv, I am going to need to manually convert each Series object within the DataFrame to a CustomSeries in order to propagate this behavior. Then I also have to combat the issue that once I perform this conversion, I would have to convert back to a regular Series to restore the behavior of my standard .mean method.

The added complexity of using inheritance as a means for extensibility comes with a price:

The user has to jump through hoops to integrate this new class into existing pandas code
The user loses convenient access of the original functionality

This concern is how pandas accessors arose. This feature offers a simple way to extend the behavior of pandas objects in a namespaced manner. The namespacing eliminates any ambiguity that the following attribute or method is user-defined, and it prevents the user from overwriting any standard exposed methods (like .mean).

In fact, chances are that you've already interacted with `accessors` in pandas- if you've ever use/seen Series.str, Series.dt, or Series.cat → you know what the accessor pattern looks like!

Those aforementioned accessors are dtype specific—Series.str is only available on object, category (if the underlying categoricals are strings), and string dtyped Series objects. Similarly, .dt. is only available on datetimens[64], and .cat on category.

from pandas.api.extensions import register_series_accessor

@register_series_accessor("life")
class MeaningOfLifeAccessor: # This is the object returned by Series().life
    def __init__(self, _series):
        # we want to be able to access the Series object in other methods
        self._series = _series
    
    def mean(self):
        return self._series.mean() + 42
    
s = Series([1,2,3])
print(
    s.mean(),
    s.life.mean(),
    sep='\n----\n'
)
# 2.0
# ----
# 44.0

Now we can namespace our custom methods and attributes onto a pandas.Series object! Obviously estimating the meaning of life isn't a very practical example, so let's look at some common use-cases for this type of approach.

The DataFrame accessor provides us with a robust method to hardcode some validation checks, and perform specific operations on the basis that certain columns exist:

from pandas.api.extensions import register_dataframe_accessor

@register_dataframe_accessor('stats')
class StatsAccessor:
    def __init__(self, _df):
        # we want to be able to access the Series object in other methods
        self._df = _df
    
    
    def demean(self, cols=None):
        if cols is None:
            cols = self._df.columns
        return df[cols] - df[cols].mean()
    
    
    def mask_outliers(self, cols=None, n_sem=3):
        if cols is None:
            cols = self._df.columns
        
        demeaned = self.demean(cols=cols)
        return df[cols].where(demeaned.abs() < (3 * demeaned.sem()))
    
    def validate(self, cols=None, n_sem=3):
        if self.mask_outliers(cols=cols, n_sem=n_sem).isna().any(axis=None):
            raise ValueError('Data has outliers, do not proceed')

Here you can see a DataFrame accessor used to implement some commonly applied custom statistical function. We can readily implement various operations that calculate or validate our underlying data. If desired, we can even hardcode column names into our accessor to enforce specific columns being available to operate on.

Let's see this accessor in action though:

from numpy.random import default_rng
from string import ascii_lowercase
from pandas import DataFrame, concat

rng = default_rng(0)

df = DataFrame(
    (data := rng.uniform(1, 100, size=(15, 2))),
    columns=list(ascii_lowercase[:data.shape[1]])
)


parts = {
    'orig': df,
    'demeaned': df.stats.demean(),
    'no_outliers': df.stats.mask_outliers()
}
print(concat(parts, axis='columns').round(2))
#      orig        demeaned        no_outliers       
#         a      b        a      b           a      b
# 0   64.06  27.71     2.31 -18.34       64.06  27.71
# 1    5.06   2.64   -56.69 -43.42         NaN    NaN
# 2   81.51  91.36    19.77  45.31       81.51    NaN
# 3   61.06  73.22    -0.69  27.17       61.06    NaN
# 4   54.82  93.57    -6.93  47.52       54.82    NaN
# 5   81.77   1.27    20.02 -44.78       81.77    NaN
# 6   85.88   4.32    24.13 -41.73         NaN    NaN
# 7   73.24  18.39    11.49 -27.66       73.24    NaN
# 8   86.45  54.60    24.71   8.55         NaN  54.60
# 9   30.67  42.85   -31.08  -3.21         NaN  42.85
# 10   3.80  13.30   -57.95 -32.75         NaN    NaN
# 11  67.39  65.07     5.64  19.02       67.39  65.07
# 12  61.92  38.98     0.17  -7.07       61.92  38.98
# 13  99.72  98.10    37.98  52.05         NaN    NaN
# 14  68.87  65.40     7.12  19.34       68.87  65.40

df.stats.validate()
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)
# /tmp/ipykernel_9283/81810956.py in <cell line: 1>()
# ----> 1 df.stats.validate()
# 
# /tmp/ipykernel_9283/1009348556.py in validate(self, cols, n_sem)
#      23     def validate(self, cols=None, n_sem=3):
#      24         if self.mask_outliers(cols=cols, n_sem=n_sem).isna().any(axis=None):
# ---> 25             raise ValueError('Data has outliers, do not proceed')
#      26 
# 
# ValueError: Data has outliers, do not proceed

With all of the implentation covered, I think it's work asking the question- what's the difference between this and having some module or a class full of staticmethods that perform these operations for us? I'd argue that there isn't much functional difference. However, by using the extension api, we can uniquely attach constraints and calculations onto the DataFrame or Series object itself—readily implementing custom features that fit our use-case.

That's all the time we have for this week! Make sure you stay tuned for future editions as I am planning to do dive deeper into how this accessor pattern works at the object level- what are those magical register_?_accessors class decorators doing and how do they work?

Additonally, I am planning to share with you some of the most absurd and unconventional applications of it of the accessor pattern to share convenient patterns that pandas is capable of performing—but not necessarily designed to do. As any consumer has said at some point: “It's not always about what the product was designed to do, but what can I make it do.” Talk to you all later!

Community Calendar

PyCamp Spain 2022
April 15–18, 2022

PyCamp Spain, a weekend that includes 4 days and 3 nights with full board (breakfast, lunch and dinner) in Girona, Spain. You will be able to program the hours you want in that project or idea that you wanted to code so long ago and that you could not find the perfect time or place to do it.

DUTC: Developing Expertise in Python & pandas
April 18–20, 2022

In this course, attendees will spend three days in hands-on live-coding and discussion sessions with James. Together we will uncover the concepts and mental models that lead to true expertise.

DUTC: What You Won't Learn on Stack Overflow - NumPy & pandas
April 22, 2022

In this seminar, we'll answer these questions:

Why are there 2 APIs, and which do I use?
What are the key components of Matplotlib that I need to know?
How do I annotate and add labels to my plot?

DUTC: What You Won't Learn on Stack Overflow: matplotlib
April 29, 2022

In this seminar, we'll answer these questions:

How do I line things up?
What if I need more dimensions?
How can I reshape things?
Thinking in groups vs grids
How can I extend NumPy? Question: How can I extend pandas?

PyCon US 2022
April 27–May 5, 2022

PyCon US is the largest annual gathering for the community using and developing the open-source Python programming language. It is produced and underwritten by the Python Software Foundation, the 501(c)(3) nonprofit organization dedicated to advancing and promoting Python. Through PyCon US, the PSF advances its mission of growing the international community of Python programmers.

PyCon LT 2022
May 26–27, 2022

DUTC's James Powell will be attending as a keynote speaker!

PyCon LT is a community event that brings together new and experienced Python users. Their goals are to grow and support a community of Python users, encourage learning and knowledge sharing, and popularize Python tools/libraries and open source in general. You can find more information on their website or Facebook page.

PyCon IT 2022
June 2–5, 2022

PyCon Italia is the Italian conference on Python. Organised by Python Italia, it has now become one of the most important Python conferences in Europe. With over 700 attendees, the next edition will be the 12th.

GeoPython 2022
June 20–22, 2022

The conference is focused on Python and Geo, its toolkits and applications. GeoPython 2022 will be the continuation of the yearly conference series. The conference started in 2016 and this is now the 7th edition. Key subjects for the conference series are the combination of the Python programming language and Geo.

EuroPython 2022
July 11–17, 2022

DUTC's James Powell and Cameron Riddell will be participating in the EuroPython mentorship program!

Welcome to the 21st EuroPython. We're the oldest and longest-running, volunteer-led Python programming conference on the planet! Join us in July in the beautiful and vibrant city of Dublin. We'll be together, face to face and online, to celebrate our shared passion for Python and its community!