`kwarray.dataframe_light`¶

A faster-than-pandas pandas-like interface to column-major data, in the case where the data only needs to be accessed by index.

For data where more complex ids are needed you must use pandas.

Module Contents¶

Classes¶

`LocLight`
`DataFrameLight`	Implements a subset of the pandas.DataFrame API
`DataFrameArray`	DataFrameLight assumes the backend is a Dict[list]

Attributes¶

`pd`
`__version__`

kwarray.dataframe_light.pd¶

kwarray.dataframe_light.__version__ = 0.0.1¶

class kwarray.dataframe_light.LocLight(parent)¶

Bases: object

__getitem__(self, index)¶

class kwarray.dataframe_light.DataFrameLight(data=None, columns=None)¶

Bases: ubelt.NiceRepr

Implements a subset of the pandas.DataFrame API

The API is restricted to facilitate speed tradeoffs

Notes

Assumes underlying data is Dict[list|ndarray]. If the data is known to be a Dict[ndarray] use DataFrameArray instead, which has faster implementations for some operations.

Notes

pandas.DataFrame is slow. DataFrameLight is faster. It is a tad more restrictive though.

Example

>>> self = DataFrameLight({})
>>> print('self = {!r}'.format(self))
>>> self = DataFrameLight({'a': [0, 1, 2], 'b': [2, 3, 4]})
>>> print('self = {!r}'.format(self))
>>> item = self.iloc[0]
>>> print('item = {!r}'.format(item))

Benchmark:

>>> # BENCHMARK
>>> # xdoc: +REQUIRES(--bench)
>>> from kwarray.dataframe_light import *  # NOQA
>>> import ubelt as ub
>>> NUM = 1000
>>> print('NUM = {!r}'.format(NUM))
>>> # to_dict conversions
>>> print('==============')
>>> print('====== to_dict conversions =====')
>>> _keys = ['list', 'dict', 'series', 'split', 'records', 'index']
>>> results = []
>>> df = DataFrameLight._demodata(num=NUM).pandas()
>>> ti = ub.Timerit(verbose=False, unit='ms')
>>> for key in _keys:
>>>     result = ti.reset(key).call(lambda: df.to_dict(orient=key))
>>>     results.append((result.mean(), result.report()))
>>> key = 'series+numpy'
>>> result = ti.reset(key).call(lambda: {k: v.values for k, v in df.to_dict(orient='series').items()})
>>> results.append((result.mean(), result.report()))
>>> print('\n'.join([t[1] for t in sorted(results)]))
>>> print('==============')
>>> print('====== DFLight Conversions =======')
>>> ti = ub.Timerit(verbose=True, unit='ms')
>>> key = 'self.pandas'
>>> self = DataFrameLight(df)
>>> ti.reset(key).call(lambda: self.pandas())
>>> key = 'light-from-pandas'
>>> ti.reset(key).call(lambda: DataFrameLight(df))
>>> key = 'light-from-dict'
>>> ti.reset(key).call(lambda: DataFrameLight(self._data))
>>> print('==============')
>>> print('====== BENCHMARK: .LOC[] =======')
>>> ti = ub.Timerit(num=20, bestof=4, verbose=True, unit='ms')
>>> df_light = DataFrameLight._demodata(num=NUM)
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> series_data = df_heavy.to_dict(orient='series')
>>> list_data = df_heavy.to_dict(orient='list')
>>> np_data = {k: v.values for k, v in df_heavy.to_dict(orient='series').items()}
>>> for timer in ti.reset('DF-heavy.iloc'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             df_heavy.iloc[i]
>>> for timer in ti.reset('DF-heavy.loc'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             df_heavy.iloc[i]
>>> for timer in ti.reset('dict[SERIES].loc'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             {key: series_data[key].loc[i] for key in series_data.keys()}
>>> for timer in ti.reset('dict[SERIES].iloc'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             {key: series_data[key].iloc[i] for key in series_data.keys()}
>>> for timer in ti.reset('dict[SERIES][]'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             {key: series_data[key][i] for key in series_data.keys()}
>>> for timer in ti.reset('dict[NDARRAY][]'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             {key: np_data[key][i] for key in np_data.keys()}
>>> for timer in ti.reset('dict[list][]'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             {key: list_data[key][i] for key in np_data.keys()}
>>> for timer in ti.reset('DF-Light.iloc/loc'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             df_light.iloc[i]
>>> for timer in ti.reset('DF-Light._getrow'):
>>>     with timer:
>>>         for i in range(NUM):
>>>             df_light._getrow(i)
NUM = 1000
==============
====== to_dict conversions =====
Timed best=0.022 ms, mean=0.022 ± 0.0 ms for series
Timed best=0.059 ms, mean=0.059 ± 0.0 ms for series+numpy
Timed best=0.315 ms, mean=0.315 ± 0.0 ms for list
Timed best=0.895 ms, mean=0.895 ± 0.0 ms for dict
Timed best=2.705 ms, mean=2.705 ± 0.0 ms for split
Timed best=5.474 ms, mean=5.474 ± 0.0 ms for records
Timed best=7.320 ms, mean=7.320 ± 0.0 ms for index
==============
====== DFLight Conversions =======
Timed best=1.798 ms, mean=1.798 ± 0.0 ms for self.pandas
Timed best=0.064 ms, mean=0.064 ± 0.0 ms for light-from-pandas
Timed best=0.010 ms, mean=0.010 ± 0.0 ms for light-from-dict
==============
====== BENCHMARK: .LOC[] =======
Timed best=101.365 ms, mean=101.564 ± 0.2 ms for DF-heavy.iloc
Timed best=102.038 ms, mean=102.273 ± 0.2 ms for DF-heavy.loc
Timed best=29.357 ms, mean=29.449 ± 0.1 ms for dict[SERIES].loc
Timed best=21.701 ms, mean=22.014 ± 0.3 ms for dict[SERIES].iloc
Timed best=11.469 ms, mean=11.566 ± 0.1 ms for dict[SERIES][]
Timed best=0.807 ms, mean=0.826 ± 0.0 ms for dict[NDARRAY][]
Timed best=0.478 ms, mean=0.492 ± 0.0 ms for dict[list][]
Timed best=0.969 ms, mean=0.994 ± 0.0 ms for DF-Light.iloc/loc
Timed best=0.760 ms, mean=0.776 ± 0.0 ms for DF-Light._getrow

property iloc(self)¶

property values(self)¶

property loc(self)¶

__eq__(self, other)¶

Example

>>> # xdoctest: +REQUIRES(module:pandas)
>>> self = DataFrameLight._demodata(num=7)
>>> other = self.pandas()
>>> assert np.all(self == other)

to_string(self, *args, **kwargs)¶

to_dict(self, orient='dict', into=dict)¶

Convert the data frame into a dictionary.

Parameters

orient (str) – Currently naitively suports orient in {‘dict’, ‘list’}, otherwise we fallback to pandas conversion and call its to_dict method.
into (type) – type of dictionary to transform into

Returns

dict

Example

>>> from kwarray.dataframe_light import *  # NOQA
>>> self = DataFrameLight._demodata(num=7)
>>> print(self.to_dict(orient='dict'))
>>> print(self.to_dict(orient='list'))

pandas(self)¶

Convert back to pandas if you need the full API

Example

>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_light = DataFrameLight._demodata(num=7)
>>> df_heavy = df_light.pandas()
>>> got = DataFrameLight(df_heavy)
>>> assert got._data == df_light._data

_pandas(self)¶: Deprecated, use self.pandas instead

classmethod _demodata(cls, num=7)¶

Example

>>> self = DataFrameLight._demodata(num=7)
>>> print('self = {!r}'.format(self))
>>> other = DataFrameLight._demodata(num=11)
>>> print('other = {!r}'.format(other))
>>> both = self.union(other)
>>> print('both = {!r}'.format(both))
>>> assert both is not self
>>> assert other is not self

__nice__(self)¶

__len__(self)¶

__contains__(self, item)¶

__normalize__(self)¶: Try to convert input data to Dict[List]

property columns(self)¶

sort_values(self, key, inplace=False, ascending=True)¶

keys(self)¶

_getrow(self, index)¶

_getcol(self, key)¶

_getcols(self, keys)¶

get(self, key, default=None)¶: Get item for given key. Returns default value if not found.

clear(self)¶: Removes all rows inplace

__getitem__(self, key)¶

Note

only handles the case where key is a single column name.

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> sub1 = df_light['bar']
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> sub2 = df_heavy['bar']
>>> assert np.all(sub1 == sub2)

__setitem__(self, key, value)¶

Note

only handles the case where key is a single column name. and value is an array of all the values to set.

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> value = [2] * len(df_light)
>>> df_light['bar'] = value
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> df_heavy['bar'] = value
>>> assert np.all(df_light == df_heavy)

compress(self, flags, inplace=False)¶: NOTE: NOT A PART OF THE PANDAS API

take(self, indices, inplace=False)¶

Return the elements in the given positional indices along an axis.

Parameters: inplace (bool) – NOT PART OF PANDAS API

Notes

assumes axis=0

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> indices = [0, 2, 3]
>>> sub1 = df_light.take(indices)
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> sub2 = df_heavy.take(indices)
>>> assert np.all(sub1 == sub2)

copy(self)¶

extend(self, other)¶

Extend self inplace using another dataframe array

Parameters: other (DataFrameLight | dict[str, Sequence]) – values to concat to end of this object

Note

Not part of the pandas API

Example

>>> self = DataFrameLight(columns=['foo', 'bar'])
>>> other = {'foo': [0], 'bar': [1]}
>>> self.extend(other)
>>> assert len(self) == 1

union(self, *others)¶: Note

Note part of the pandas API

classmethod concat(cls, others)¶

classmethod from_pandas(cls, df)¶

classmethod from_dict(cls, records)¶

reset_index(self, drop=False)¶: noop for compatability, the light version doesnt store an index

groupby(self, by=None, *args, **kwargs)¶

Group rows by the value of a column. Unlike pandas this simply returns a zip object. To ensure compatiability call list on the result of groupby.

Parameters

by (str) – column name to group by
*args – if specified, the dataframe is coerced to pandas
*kwargs – if specified, the dataframe is coerced to pandas

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> res1 = list(df_light.groupby('bar'))
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> res2 = list(df_heavy.groupby('bar'))
>>> assert len(res1) == len(res2)
>>> assert all([np.all(a[1] == b[1]) for a, b in zip(res1, res2)])

Ignore:

>>> self = DataFrameLight._demodata(num=1000)
>>> args = ['cx']
>>> self['cx'] = (np.random.rand(len(self)) * 10).astype(np.int)
>>> # As expected, our custom restricted implementation is faster
>>> # than pandas
>>> ub.Timerit(100).call(lambda: dict(list(self.pandas().groupby('cx')))).print()
>>> ub.Timerit(100).call(lambda: dict(self.groupby('cx'))).print()

rename(self, mapper=None, columns=None, axis=None, inplace=False)¶

Rename the columns (index renaming is not supported)

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> mapper = {'foo': 'fi'}
>>> res1 = df_light.rename(columns=mapper)
>>> res3 = df_light.rename(mapper, axis=1)
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> res2 = df_heavy.rename(columns=mapper)
>>> res4 = df_heavy.rename(mapper, axis=1)
>>> assert np.all(res1 == res2)
>>> assert np.all(res3 == res2)
>>> assert np.all(res3 == res4)

iterrows(self)¶

Iterate over rows as (index, Dict) pairs.

Yields: Tuple[int, Dict] – the index and a dictionary representing a row

Example

>>> from kwarray.dataframe_light import *  # NOQA
>>> self = DataFrameLight._demodata(num=3)
>>> print(ub.repr2(list(self.iterrows())))
[
    (0, {'bar': 0, 'baz': 2.73, 'foo': 0}),
    (1, {'bar': 1, 'baz': 2.73, 'foo': 0}),
    (2, {'bar': 2, 'baz': 2.73, 'foo': 0}),
]

Benchmark:

>>> # xdoc: +REQUIRES(--bench)
>>> from kwarray.dataframe_light import *  # NOQA
>>> import ubelt as ub
>>> df_light = DataFrameLight._demodata(num=1000)
>>> df_heavy = df_light.pandas()
>>> ti = ub.Timerit(21, bestof=3, verbose=2, unit='ms')
>>> ti.reset('light').call(lambda: list(df_light.iterrows()))
>>> ti.reset('heavy').call(lambda: list(df_heavy.iterrows()))
>>> # xdoctest: +IGNORE_WANT
Timed light for: 21 loops, best of 3
    time per loop: best=0.834 ms, mean=0.850 ± 0.0 ms
Timed heavy for: 21 loops, best of 3
    time per loop: best=45.007 ms, mean=45.633 ± 0.5 ms

class kwarray.dataframe_light.DataFrameArray(data=None, columns=None)¶

Bases: DataFrameLight

DataFrameLight assumes the backend is a Dict[list] DataFrameArray assumes the backend is a Dict[ndarray]

Take and compress are much faster, but extend and union are slower

__normalize__(self)¶: Try to convert input data to Dict[ndarray]

extend(self, other)¶

Extend self inplace using another dataframe array

Parameters: other (DataFrameLight | dict[str, Sequence]) – values to concat to end of this object

Note

Not part of the pandas API

Example

>>> self = DataFrameLight(columns=['foo', 'bar'])
>>> other = {'foo': [0], 'bar': [1]}
>>> self.extend(other)
>>> assert len(self) == 1

compress(self, flags, inplace=False)¶: NOTE: NOT A PART OF THE PANDAS API

take(self, indices, inplace=False)¶

Return the elements in the given positional indices along an axis.

Parameters: inplace (bool) – NOT PART OF PANDAS API

Notes

assumes axis=0

Example

>>> df_light = DataFrameLight._demodata(num=7)
>>> indices = [0, 2, 3]
>>> sub1 = df_light.take(indices)
>>> # xdoctest: +REQUIRES(module:pandas)
>>> df_heavy = df_light.pandas()
>>> sub2 = df_heavy.take(indices)
>>> assert np.all(sub1 == sub2)

kwarray.dataframe_light¶

Module Contents¶

Classes¶

Attributes¶

`kwarray.dataframe_light`¶