kwarray.dataframe_light

A faster-than-pandas pandas-like interface to column-major data, in the case where the data only needs to be accessed by index.

For data where more complex ids are needed you must use pandas.

Module Contents

Classes

LocLight

DataFrameLight

Implements a subset of the pandas.DataFrame API

DataFrameArray

DataFrameLight assumes the backend is a Dict[list]

Attributes

pd

__version__

kwarray.dataframe_light.pd
kwarray.dataframe_light.__version__ = 0.0.1
class kwarray.dataframe_light.LocLight(parent)

Bases: object

__getitem__(self, index)
class kwarray.dataframe_light.DataFrameLight(data=None, columns=None)

Bases: ubelt.NiceRepr

Implements a subset of the pandas.DataFrame API

The API is restricted to facilitate speed tradeoffs

Notes

Assumes underlying data is Dict[list|ndarray]. If the data is known to be a Dict[ndarray] use DataFrameArray instead, which has faster implementations for some operations.

Notes

pandas.DataFrame is slow. DataFrameLight is faster. It is a tad more restrictive though.

Example

>>> self = DataFrameLight({})
>>> print('self = {!r}'.format(self))
>>> self = DataFrameLight({'a': [0, 1, 2], 'b': [2, 3, 4]})
>>> print('self = {!r}'.format(self))
>>> item = self.iloc[0]
>>> print('item = {!r}'.format(item))
Benchmark:
>>> # BENCHMARK
>>> # xdoc: +REQUIRES(--bench)
>>> from kwarray.dataframe_light import *  # NOQA
>>> import ubelt as ub
>>> NUM = 1000
>>> print('NUM = {!r}'.format(NUM))
>>> # to_dict conversions
>>> print('==============')
>>> print('====== to_dict conversions =====')
>>> _keys = ['list', 'dict', 'series', 'split', 'records', 'index']
>>> results = []
>>> df = DataFrameLight._demodata(num=NUM).pandas()
>>> ti = ub.Timerit(verbose=False, unit='ms')
>>> for key in _keys:
>>>     result = ti.reset(key).call(lambda: df.to_dict(orient=key))
>>>     results.append((result.mean(), result.report()))
>>> key = 'series+numpy'
>>> result = ti.reset(key).call(lambda: {k: v.values for k, v in df.to_dict(orient='series').items()})
>>> results.append((result.mean(), result.report()))
>>> print('\n'.join([t[1] for t in sorted(results)]))
>>> print('==============')
>>> print('====== DFLight Conversions =======')
>>> ti = ub.Timerit(verbose=True, unit='ms')
>>> key = 'self.pandas'
>>> self = DataFrameLight(df)
>>> ti.reset(key).call(lambda: self.pandas())
>>> key = 'light-from-pandas'
>>> ti.reset(key).call(lambda: DataFrameLight(df))
>>> key = 'light-from-dict'
>>> ti.reset(key).call(lambda: DataFrameLight(self._data))
>>> print('==============')
>>> print('====== BENCHMARK: .LOC[] =======')
>>> ti = ub.Timerit(num=20, bestof=4, verbose=True, unit='ms')
>>> df_light = DataFrameLight._demodata(num=NUM)
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> series_data = df_heavy.to_dict(orient='series')
>>> list_data = df_heavy.to_dict(orient='list')
>>> np_data = {k: v.values for k, v in df_heavy.to_dict(orient='series').items()}
>>> for timer in ti.reset('DF-heavy.iloc'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             df_heavy.iloc[i]
>>> for timer in ti.reset('DF-heavy.loc'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             df_heavy.iloc[i]
>>> for timer in ti.reset('dict[SERIES].loc'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             {key: series_data[key].loc[i] for key in series_data.keys()}
>>> for timer in ti.reset('dict[SERIES].iloc'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             {key: series_data[key].iloc[i] for key in series_data.keys()}
>>> for timer in ti.reset('dict[SERIES][]'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             {key: series_data[key][i] for key in series_data.keys()}
>>> for timer in ti.reset('dict[NDARRAY][]'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             {key: np_data[key][i] for key in np_data.keys()}
>>> for timer in ti.reset('dict[list][]'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             {key: list_data[key][i] for key in np_data.keys()}
>>> for timer in ti.reset('DF-Light.iloc/loc'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             df_light.iloc[i]
>>> for timer in ti.reset('DF-Light._getrow'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             df_light._getrow(i)
NUM = 1000
==============
====== to_dict conversions =====
Timed best=0.022 ms, mean=0.022 ± 0.0 ms for series
Timed best=0.059 ms, mean=0.059 ± 0.0 ms for series+numpy
Timed best=0.315 ms, mean=0.315 ± 0.0 ms for list
Timed best=0.895 ms, mean=0.895 ± 0.0 ms for dict
Timed best=2.705 ms, mean=2.705 ± 0.0 ms for split
Timed best=5.474 ms, mean=5.474 ± 0.0 ms for records
Timed best=7.320 ms, mean=7.320 ± 0.0 ms for index
==============
====== DFLight Conversions =======
Timed best=1.798 ms, mean=1.798 ± 0.0 ms for self.pandas
Timed best=0.064 ms, mean=0.064 ± 0.0 ms for light-from-pandas
Timed best=0.010 ms, mean=0.010 ± 0.0 ms for light-from-dict
==============
====== BENCHMARK: .LOC[] =======
Timed best=101.365 ms, mean=101.564 ± 0.2 ms for DF-heavy.iloc
Timed best=102.038 ms, mean=102.273 ± 0.2 ms for DF-heavy.loc
Timed best=29.357 ms, mean=29.449 ± 0.1 ms for dict[SERIES].loc
Timed best=21.701 ms, mean=22.014 ± 0.3 ms for dict[SERIES].iloc
Timed best=11.469 ms, mean=11.566 ± 0.1 ms for dict[SERIES][]
Timed best=0.807 ms, mean=0.826 ± 0.0 ms for dict[NDARRAY][]
Timed best=0.478 ms, mean=0.492 ± 0.0 ms for dict[list][]
Timed best=0.969 ms, mean=0.994 ± 0.0 ms for DF-Light.iloc/loc
Timed best=0.760 ms, mean=0.776 ± 0.0 ms for DF-Light._getrow
property iloc(self)
property values(self)
property loc(self)
__eq__(self, other)

Example

>>> # xdoctest: +REQUIRES(module:pandas)
>>> self = DataFrameLight._demodata(num=7)
>>> other = self.pandas()
>>> assert np.all(self == other)
to_string(self, *args, **kwargs)
to_dict(self, orient='dict', into=dict)

Convert the data frame into a dictionary.

Parameters
  • orient (str) – Currently naitively suports orient in {‘dict’, ‘list’}, otherwise we fallback to pandas conversion and call its to_dict method.

  • into (type) – type of dictionary to transform into

Returns

dict

Example

>>> from kwarray.dataframe_light import *  # NOQA
>>> self = DataFrameLight._demodata(num=7)
>>> print(self.to_dict(orient='dict'))
>>> print(self.to_dict(orient='list'))
pandas(self)

Convert back to pandas if you need the full API

Example

>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_light = DataFrameLight._demodata(num=7)
>>> df_heavy = df_light.pandas()
>>> got = DataFrameLight(df_heavy)
>>> assert got._data == df_light._data
_pandas(self)

Deprecated, use self.pandas instead

classmethod _demodata(cls, num=7)

Example

>>> self = DataFrameLight._demodata(num=7)
>>> print('self = {!r}'.format(self))
>>> other = DataFrameLight._demodata(num=11)
>>> print('other = {!r}'.format(other))
>>> both = self.union(other)
>>> print('both = {!r}'.format(both))
>>> assert both is not self
>>> assert other is not self
__nice__(self)
__len__(self)
__contains__(self, item)
__normalize__(self)

Try to convert input data to Dict[List]

property columns(self)
sort_values(self, key, inplace=False, ascending=True)
keys(self)
_getrow(self, index)
_getcol(self, key)
_getcols(self, keys)
get(self, key, default=None)

Get item for given key. Returns default value if not found.

clear(self)

Removes all rows inplace

__getitem__(self, key)

Note

only handles the case where key is a single column name.

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> sub1 = df_light['bar']
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> sub2 = df_heavy['bar']
>>> assert np.all(sub1 == sub2)
__setitem__(self, key, value)

Note

only handles the case where key is a single column name. and value is an array of all the values to set.

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> value = [2] * len(df_light)
>>> df_light['bar'] = value
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> df_heavy['bar'] = value
>>> assert np.all(df_light == df_heavy)
compress(self, flags, inplace=False)

NOTE: NOT A PART OF THE PANDAS API

take(self, indices, inplace=False)

Return the elements in the given positional indices along an axis.

Parameters

inplace (bool) – NOT PART OF PANDAS API

Notes

assumes axis=0

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> indices = [0, 2, 3]
>>> sub1 = df_light.take(indices)
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> sub2 = df_heavy.take(indices)
>>> assert np.all(sub1 == sub2)
copy(self)
extend(self, other)

Extend self inplace using another dataframe array

Parameters

other (DataFrameLight | dict[str, Sequence]) – values to concat to end of this object

Note

Not part of the pandas API

Example

>>> self = DataFrameLight(columns=['foo', 'bar'])
>>> other = {'foo': [0], 'bar': [1]}
>>> self.extend(other)
>>> assert len(self) == 1
union(self, *others)

Note

Note part of the pandas API

classmethod concat(cls, others)
classmethod from_pandas(cls, df)
classmethod from_dict(cls, records)
reset_index(self, drop=False)

noop for compatability, the light version doesnt store an index

groupby(self, by=None, *args, **kwargs)

Group rows by the value of a column. Unlike pandas this simply returns a zip object. To ensure compatiability call list on the result of groupby.

Parameters
  • by (str) – column name to group by

  • *args – if specified, the dataframe is coerced to pandas

  • *kwargs – if specified, the dataframe is coerced to pandas

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> res1 = list(df_light.groupby('bar'))
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> res2 = list(df_heavy.groupby('bar'))
>>> assert len(res1) == len(res2)
>>> assert all([np.all(a[1] == b[1]) for a, b in zip(res1, res2)])
Ignore:
>>> self = DataFrameLight._demodata(num=1000)
>>> args = ['cx']
>>> self['cx'] = (np.random.rand(len(self)) * 10).astype(np.int)
>>> # As expected, our custom restricted implementation is faster
>>> # than pandas
>>> ub.Timerit(100).call(lambda: dict(list(self.pandas().groupby('cx')))).print()
>>> ub.Timerit(100).call(lambda: dict(self.groupby('cx'))).print()
rename(self, mapper=None, columns=None, axis=None, inplace=False)

Rename the columns (index renaming is not supported)

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> mapper = {'foo': 'fi'}
>>> res1 = df_light.rename(columns=mapper)
>>> res3 = df_light.rename(mapper, axis=1)
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> res2 = df_heavy.rename(columns=mapper)
>>> res4 = df_heavy.rename(mapper, axis=1)
>>> assert np.all(res1 == res2)
>>> assert np.all(res3 == res2)
>>> assert np.all(res3 == res4)
iterrows(self)

Iterate over rows as (index, Dict) pairs.

Yields

Tuple[int, Dict] – the index and a dictionary representing a row

Example

>>> from kwarray.dataframe_light import *  # NOQA
>>> self = DataFrameLight._demodata(num=3)
>>> print(ub.repr2(list(self.iterrows())))
[
    (0, {'bar': 0, 'baz': 2.73, 'foo': 0}),
    (1, {'bar': 1, 'baz': 2.73, 'foo': 0}),
    (2, {'bar': 2, 'baz': 2.73, 'foo': 0}),
]
Benchmark:
>>> # xdoc: +REQUIRES(--bench)
>>> from kwarray.dataframe_light import *  # NOQA
>>> import ubelt as ub
>>> df_light = DataFrameLight._demodata(num=1000)
>>> df_heavy = df_light.pandas()
>>> ti = ub.Timerit(21, bestof=3, verbose=2, unit='ms')
>>> ti.reset('light').call(lambda: list(df_light.iterrows()))
>>> ti.reset('heavy').call(lambda: list(df_heavy.iterrows()))
>>> # xdoctest: +IGNORE_WANT
Timed light for: 21 loops, best of 3
    time per loop: best=0.834 ms, mean=0.850 ± 0.0 ms
Timed heavy for: 21 loops, best of 3
    time per loop: best=45.007 ms, mean=45.633 ± 0.5 ms
class kwarray.dataframe_light.DataFrameArray(data=None, columns=None)

Bases: DataFrameLight

DataFrameLight assumes the backend is a Dict[list] DataFrameArray assumes the backend is a Dict[ndarray]

Take and compress are much faster, but extend and union are slower

__normalize__(self)

Try to convert input data to Dict[ndarray]

extend(self, other)

Extend self inplace using another dataframe array

Parameters

other (DataFrameLight | dict[str, Sequence]) – values to concat to end of this object

Note

Not part of the pandas API

Example

>>> self = DataFrameLight(columns=['foo', 'bar'])
>>> other = {'foo': [0], 'bar': [1]}
>>> self.extend(other)
>>> assert len(self) == 1
compress(self, flags, inplace=False)

NOTE: NOT A PART OF THE PANDAS API

take(self, indices, inplace=False)

Return the elements in the given positional indices along an axis.

Parameters

inplace (bool) – NOT PART OF PANDAS API

Notes

assumes axis=0

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> indices = [0, 2, 3]
>>> sub1 = df_light.take(indices)
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> sub2 = df_heavy.take(indices)
>>> assert np.all(sub1 == sub2)