Time complexity of dict vs attributes

heras · Aug-23-2018, 08:12 PM

Hi,

I was going over some data. To store results I manually added empty list attributes to a class as needed. In order to make this dynamic I added a dict attribute that would store the series name(key) and series data(value). This turned out to be slow for large data sets. Then I decided to add attributes dynamically with setattr. This was fast again, like manually setting attributes. But I don't understand why a dict would be that much slower. I think attributes are stored in a dict anyway, so storing my results in a dict would only add one more look-up to each cycle, perhaps doubling execution time. Is there a loop in a loop somewhere that I'm missing?

The examples below are on a 10.000 value data set. With 100.000 values the attr script runs in less than a second, the dict version might take ... an hour? I can't be bothered to wait till it's done.

I tried to make it all as short as possible, please bear with me.

The script reads commands from a text file and runs these for each new data value.

# Operation              data source    destination    parameter

simple_moving_average    data           sma250         250
simple_moving_average    sma250         sma500         500

Here is using a dict:

#!/usr/bin/env python3

class DataFrame:
    def __init__(self):
        self.data = {}

    def load_cmd_register(self, register):
        self.cmd_register = []
        labels = ['operation', 'source', 'destination', 'parameter']
        with open(register) as reg:
            for line in reg:
                if line.startswith('#'):
                    continue
                elif not line.strip():
                    continue
                reg = dict(zip(labels, line.split()))
                reg['parameter'] = int(reg['parameter'])
                self.cmd_register.append(reg)

    def cycle_cmd_register(self):
        for reg in self.cmd_register:
            op    = getattr(self, reg['operation'])
            src   = reg['source']
            dest  = reg['destination']
            param = reg['parameter']
            op(src, dest, param)

    def add_value(self, label, value):
        try:
            self.data[label].append(value)
        except KeyError:
            self.data[label] = [value]

    def simple_moving_average(self, src, dest, window):
        try:
            cumsum = self.data[dest+'_cumsum'][-1] + self.data[src][-1]
        except KeyError:
            cumsum = self.data[src][0]
        self.add_value(dest+'_cumsum', cumsum)

        len_cumsum = len(self.data[dest+'_cumsum'][:])
        if len_cumsum <= window:
            sma = self.data[dest+'_cumsum'][-1] / len_cumsum
        else:
            sma = (self.data[dest+'_cumsum'][-1] \
                  -self.data[dest+'_cumsum'][-window - 1]) / window
        self.add_value(dest, sma)


def read_line(fname):
    with open(fname) as f:
        for data in f:
            yield float(data)

register = 'cmd_register.conf'
df = DataFrame()
df.load_cmd_register(register)

fname = 'data2.txt'
for value in read_line(fname):
    df.add_value('data', value)
    df.cycle_cmd_register()

$ python3 -m cProfile with_dict.py 
         180054 function calls in 2.803 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 _bootlocale.py:23(getpreferredencoding)
        2    0.000    0.000    0.000    0.000 codecs.py:259(__init__)
        2    0.000    0.000    0.000    0.000 codecs.py:308(__init__)
       14    0.000    0.000    0.000    0.000 codecs.py:318(decode)
    10000    0.052    0.000    2.732    0.000 with_dict.py:20(cycle_cmd_register)
    50000    0.050    0.000    0.059    0.000 with_dict.py:28(add_value)
        1    0.024    0.024    2.803    2.803 with_dict.py:3(<module>)
        1    0.000    0.000    0.000    0.000 with_dict.py:3(DataFrame)
    20000    2.607    0.000    2.668    0.000 with_dict.py:34(simple_moving_average)
        1    0.000    0.000    0.000    0.000 with_dict.py:4(__init__)
    10001    0.039    0.000    0.039    0.000 with_dict.py:50(read_line)
        1    0.000    0.000    0.000    0.000 with_dict.py:7(load_cmd_register)
       14    0.000    0.000    0.000    0.000 {built-in method _codecs.utf_8_decode}
        2    0.000    0.000    0.000    0.000 {built-in method _locale.nl_langinfo}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.__build_class__}
        1    0.000    0.000    2.803    2.803 {built-in method builtins.exec}
    20000    0.012    0.000    0.012    0.000 {built-in method builtins.getattr}
    20000    0.011    0.000    0.011    0.000 {built-in method builtins.len}
        2    0.000    0.000    0.000    0.000 {built-in method io.open}
    49997    0.009    0.000    0.009    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
        5    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {method 'strip' of 'str' objects}

Here is using attributes:

#!/usr/bin/env python3

class DataFrame:
    def load_cmd_register(self, register):
        self.cmd_register = []
        labels = ['operation', 'source', 'destination', 'parameter']
        with open(register) as reg:
            for line in reg:
                if line.startswith('#'):
                    continue
                elif not line.strip():
                    continue
                reg = dict(zip(labels, line.split()))
                reg['source'] = reg['source'].split(',')
                reg['destination'] = reg['destination'].split(',')
                reg['parameter'] = [int(p) for p in reg['parameter'].split(',')]
                self.cmd_register.append(reg)

    def cycle_cmd_register(self):
        for cmd in self.cmd_register:
            op    = getattr(self, cmd['operation'])
            src   = cmd['source']
            dest  = cmd['destination']
            param = cmd['parameter']
            op(*src, *dest, *param)

    def add_value(self, label, value):
        try:
            attr = getattr(self, label)
            attr.append(value)
        except AttributeError:
            setattr(self, label, [value])

    def simple_moving_average(self, src, dest, window):
        try:
            src_attr  = getattr(self, src)
            dest_attr = getattr(self, dest+'_cumsum')
            cumsum = src_attr[-1] + dest_attr[-1]
        except AttributeError:
            cumsum = src_attr[0]
        self.add_value(dest+'_cumsum', cumsum)

        dest_attr = getattr(self, dest+'_cumsum')
        len_cumsum = len(dest_attr)

        if len_cumsum <= window:
            sma = dest_attr[-1] / len_cumsum
        else:
            sma = (dest_attr[-1] - dest_attr[-window - 1]) / window
        self.add_value(dest, sma)


def read_line(fname):
    with open(fname) as f:
        for data in f:
            yield float(data)

register = 'cmd_register.conf'
df = DataFrame()
df.load_cmd_register(register)

fname = 'data2.txt'
for value in read_line(fname):
    df.add_value('data', value)
    df.cycle_cmd_register()

$ python3 -m cProfile with_attr.py 
         290066 function calls in 0.199 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 _bootlocale.py:23(getpreferredencoding)
        2    0.000    0.000    0.000    0.000 codecs.py:259(__init__)
        2    0.000    0.000    0.000    0.000 codecs.py:308(__init__)
       14    0.000    0.000    0.000    0.000 codecs.py:318(decode)
        2    0.000    0.000    0.000    0.000 with_attr.py:16(<listcomp>)
    10000    0.031    0.000    0.166    0.000 with_attr.py:19(cycle_cmd_register)
    50000    0.033    0.000    0.054    0.000 with_attr.py:27(add_value)
        1    0.012    0.012    0.199    0.199 with_attr.py:3(<module>)
        1    0.000    0.000    0.000    0.000 with_attr.py:3(DataFrame)
    20000    0.064    0.000    0.129    0.000 with_attr.py:34(simple_moving_average)
        1    0.000    0.000    0.000    0.000 with_attr.py:4(load_cmd_register)
    10001    0.010    0.000    0.010    0.000 with_attr.py:53(read_line)
       14    0.000    0.000    0.000    0.000 {built-in method _codecs.utf_8_decode}
        2    0.000    0.000    0.000    0.000 {built-in method _locale.nl_langinfo}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.__build_class__}
        1    0.000    0.000    0.199    0.199 {built-in method builtins.exec}
   130000    0.039    0.000    0.039    0.000 {built-in method builtins.getattr}
    20000    0.003    0.000    0.003    0.000 {built-in method builtins.len}
        5    0.000    0.000    0.000    0.000 {built-in method builtins.setattr}
        2    0.000    0.000    0.000    0.000 {built-in method io.open}
    49997    0.007    0.000    0.007    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        8    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
        5    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        4    0.000    0.000    0.000    0.000 {method 'strip' of 'str' objects}

heras · (This post was last modified: Aug-24-2018, 08:13 PM by heras.)

I've simplified the code some more, hoping to make it more clear. This time I profiled both running over 20.000 values. Most of the time is spent in the simple_moving_average method of the dict version. Maybe a lot of dict look-ups are slow somehow, but why would it get exponentially slower? What can I do to narrow it down some more?

#!/usr/bin/env python3

class DataFrame:
    def __init__(self):
        self.data = {}

    def add_value(self, label, value):
        try:
            self.data[label].append(value)
        except KeyError:
            self.data[label] = [value]

    def simple_moving_average(self, src, dest, window):
        try:
            cumsum = self.data[dest+'_cumsum'][-1] + self.data[src][-1]
        except KeyError:
            cumsum = self.data[src][0]
        self.add_value(dest+'_cumsum', cumsum)

        len_cumsum = len(self.data[dest+'_cumsum'][:])
        if len_cumsum <= window:
            sma = self.data[dest+'_cumsum'][-1] / len_cumsum
        else:
            sma = (self.data[dest+'_cumsum'][-1] \
                  -self.data[dest+'_cumsum'][-window - 1]) / window
        self.add_value(dest, sma)


def read_line(fname):
    with open(fname) as f:
        for data in f:
            yield float(data)

df = DataFrame()
fname = 'data2.txt'
for value in read_line(fname):
    df.add_value('data', value)
    df.simple_moving_average('data'  , 'sma250', 250)
    df.simple_moving_average('sma250', 'sma500', 500)

$ python3 -m cProfile with_dict.py 
         300053 function calls in 14.282 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 _bootlocale.py:23(getpreferredencoding)
        1    0.000    0.000    0.000    0.000 codecs.py:259(__init__)
        1    0.000    0.000    0.000    0.000 codecs.py:308(__init__)
       23    0.000    0.000    0.000    0.000 codecs.py:318(decode)
    40000   13.913    0.000   14.067    0.000 with_dict.py:13(simple_moving_average)
    20001    0.108    0.000    0.108    0.000 with_dict.py:29(read_line)
        1    0.091    0.091   14.282   14.282 with_dict.py:3(<module>)
        1    0.000    0.000    0.000    0.000 with_dict.py:3(DataFrame)
        1    0.000    0.000    0.000    0.000 with_dict.py:4(__init__)
   100000    0.123    0.000    0.142    0.000 with_dict.py:7(add_value)
       23    0.000    0.000    0.000    0.000 {built-in method _codecs.utf_8_decode}
        1    0.000    0.000    0.000    0.000 {built-in method _locale.nl_langinfo}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.__build_class__}
        1    0.000    0.000   14.282   14.282 {built-in method builtins.exec}
    40000    0.028    0.000    0.028    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
    99995    0.020    0.000    0.020    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

#!/usr/bin/env python3

class DataFrame:
    def add_value(self, label, value):
        try:
            attr = getattr(self, label)
            attr.append(value)
        except AttributeError:
            setattr(self, label, [value])

    def simple_moving_average(self, src, dest, window):
        try:
            src_attr  = getattr(self, src)
            dest_attr = getattr(self, dest+'_cumsum')
            cumsum = src_attr[-1] + dest_attr[-1]
        except AttributeError:
            cumsum = src_attr[0]
        self.add_value(dest+'_cumsum', cumsum)

        dest_attr = getattr(self, dest+'_cumsum')
        len_cumsum = len(dest_attr)

        if len_cumsum <= window:
            sma = dest_attr[-1] / len_cumsum
        else:
            sma = (dest_attr[-1] - dest_attr[-window - 1]) / window
        self.add_value(dest, sma)

def read_line(fname):
    with open(fname) as f:
        for data in f:
            yield float(data)

df = DataFrame()
fname = 'data2.txt'
for value in read_line(fname):
    df.add_value('data', value)
    df.simple_moving_average('data'  , 'sma250', 250)
    df.simple_moving_average('sma250', 'sma500', 500)

$ python3 -m cProfile with_attr.py 
         520057 function calls in 0.330 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 _bootlocale.py:23(getpreferredencoding)
        1    0.000    0.000    0.000    0.000 codecs.py:259(__init__)
        1    0.000    0.000    0.000    0.000 codecs.py:308(__init__)
       23    0.000    0.000    0.000    0.000 codecs.py:318(decode)
    40000    0.126    0.000    0.257    0.000 with_attr.py:11(simple_moving_average)
    20001    0.017    0.000    0.017    0.000 with_attr.py:29(read_line)
        1    0.036    0.036    0.330    0.330 with_attr.py:3(<module>)
        1    0.000    0.000    0.000    0.000 with_attr.py:3(DataFrame)
   100000    0.063    0.000    0.104    0.000 with_attr.py:4(add_value)
       23    0.000    0.000    0.000    0.000 {built-in method _codecs.utf_8_decode}
        1    0.000    0.000    0.000    0.000 {built-in method _locale.nl_langinfo}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.__build_class__}
        1    0.000    0.000    0.330    0.330 {built-in method builtins.exec}
   220000    0.067    0.000    0.067    0.000 {built-in method builtins.getattr}
    40000    0.007    0.000    0.007    0.000 {built-in method builtins.len}
        5    0.000    0.000    0.000    0.000 {built-in method builtins.setattr}
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
    99995    0.013    0.000    0.013    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

**Gribouillis** · (This post was last modified: Aug-24-2018, 09:07 PM by Gribouillis.)

Does it change anything if you remove the [:] at line 20 in the first code?

heras · Aug-24-2018, 09:20 PM

You are brilliant!
A small test:

>>> a = {'as': [1,2,3,4,5,6]}
>>> a
{'as': [1, 2, 3, 4, 5, 6]}
>>> len(a['as'])
6
>>> len(a['as'][:])
6

I will see if I can figure out on my own what happens here. The dict version is totally fast now. Thanks!

**Gribouillis** · Aug-24-2018, 09:35 PM

(Aug-24-2018, 09:20 PM)heras Wrote: You are brilliant!

I only used a good diff and merge tool named kdiff3. This tiny [:] was the most obvious difference. Knowing that you were duplicating lists for no reason, the rest was easy.

heras · Aug-24-2018, 10:16 PM

So

len_cumsum = len(self.data[dest+'_cumsum'])

will ask whatever object is associated with dest+'_cumsum' for its length.

len_cumsum = len(self.data[dest+'_cumsum'][:])

will recreate the list inside len(). Will that list be a deep or a shallow copy? Is there a way to tell?

**Gribouillis** · Aug-24-2018, 10:18 PM

(Aug-24-2018, 10:16 PM)heras Wrote: Will that list be a deep or a shallow copy?

A list copy is a shallow copy. A new list is created pointing to the same items.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	What is the run time complexity of this code and please explain?	samlee916	2	3,487	Nov-06-2020, 02:37 PM Last Post: deanhystad
	Sort a dict in dict	cherry_cherry	4	105,802	Apr-08-2020, 12:25 PM Last Post: perfringo
	Basic Time Complexity Help	Adakip	1	2,779	Aug-25-2019, 09:12 AM Last Post: ThomasL
	[split] Time Complexity of Counting	Mekire	9	11,208	Jan-10-2019, 11:09 AM Last Post: Gribouillis
	How to calculate the mccabe complexity in module falke8?	fuqianz	1	4,264	Oct-16-2017, 02:08 AM Last Post: sparkz_alot

Time complexity of dict vs attributes

User Panel Messages

Announcements