kwarray.dataframe_light
¶
A faster-than-pandas pandas-like interface to column-major data, in the case where the data only needs to be accessed by index.
For data where more complex ids are needed you must use pandas.
Module Contents¶
Classes¶
Implements a subset of the pandas.DataFrame API |
|
DataFrameLight assumes the backend is a Dict[list] |
Attributes¶
- kwarray.dataframe_light.pd¶
- kwarray.dataframe_light.__version__ = 0.0.1¶
- class kwarray.dataframe_light.DataFrameLight(data=None, columns=None)¶
Bases:
ubelt.NiceRepr
Implements a subset of the pandas.DataFrame API
The API is restricted to facilitate speed tradeoffs
Notes
Assumes underlying data is Dict[list|ndarray]. If the data is known to be a Dict[ndarray] use DataFrameArray instead, which has faster implementations for some operations.
Notes
pandas.DataFrame is slow. DataFrameLight is faster. It is a tad more restrictive though.
Example
>>> self = DataFrameLight({}) >>> print('self = {!r}'.format(self)) >>> self = DataFrameLight({'a': [0, 1, 2], 'b': [2, 3, 4]}) >>> print('self = {!r}'.format(self)) >>> item = self.iloc[0] >>> print('item = {!r}'.format(item))
- Benchmark:
>>> # BENCHMARK >>> # xdoc: +REQUIRES(--bench) >>> from kwarray.dataframe_light import * # NOQA >>> import ubelt as ub >>> NUM = 1000 >>> print('NUM = {!r}'.format(NUM)) >>> # to_dict conversions >>> print('==============') >>> print('====== to_dict conversions =====') >>> _keys = ['list', 'dict', 'series', 'split', 'records', 'index'] >>> results = [] >>> df = DataFrameLight._demodata(num=NUM).pandas() >>> ti = ub.Timerit(verbose=False, unit='ms') >>> for key in _keys: >>> result = ti.reset(key).call(lambda: df.to_dict(orient=key)) >>> results.append((result.mean(), result.report())) >>> key = 'series+numpy' >>> result = ti.reset(key).call(lambda: {k: v.values for k, v in df.to_dict(orient='series').items()}) >>> results.append((result.mean(), result.report())) >>> print('\n'.join([t[1] for t in sorted(results)])) >>> print('==============') >>> print('====== DFLight Conversions =======') >>> ti = ub.Timerit(verbose=True, unit='ms') >>> key = 'self.pandas' >>> self = DataFrameLight(df) >>> ti.reset(key).call(lambda: self.pandas()) >>> key = 'light-from-pandas' >>> ti.reset(key).call(lambda: DataFrameLight(df)) >>> key = 'light-from-dict' >>> ti.reset(key).call(lambda: DataFrameLight(self._data)) >>> print('==============') >>> print('====== BENCHMARK: .LOC[] =======') >>> ti = ub.Timerit(num=20, bestof=4, verbose=True, unit='ms') >>> df_light = DataFrameLight._demodata(num=NUM) >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> series_data = df_heavy.to_dict(orient='series') >>> list_data = df_heavy.to_dict(orient='list') >>> np_data = {k: v.values for k, v in df_heavy.to_dict(orient='series').items()} >>> for timer in ti.reset('DF-heavy.iloc'): >>> with timer: >>> for i in range(NUM): >>> df_heavy.iloc[i] >>> for timer in ti.reset('DF-heavy.loc'): >>> with timer: >>> for i in range(NUM): >>> df_heavy.iloc[i] >>> for timer in ti.reset('dict[SERIES].loc'): >>> with timer: >>> for i in range(NUM): >>> {key: series_data[key].loc[i] for key in series_data.keys()} >>> for timer in ti.reset('dict[SERIES].iloc'): >>> with timer: >>> for i in range(NUM): >>> {key: series_data[key].iloc[i] for key in series_data.keys()} >>> for timer in ti.reset('dict[SERIES][]'): >>> with timer: >>> for i in range(NUM): >>> {key: series_data[key][i] for key in series_data.keys()} >>> for timer in ti.reset('dict[NDARRAY][]'): >>> with timer: >>> for i in range(NUM): >>> {key: np_data[key][i] for key in np_data.keys()} >>> for timer in ti.reset('dict[list][]'): >>> with timer: >>> for i in range(NUM): >>> {key: list_data[key][i] for key in np_data.keys()} >>> for timer in ti.reset('DF-Light.iloc/loc'): >>> with timer: >>> for i in range(NUM): >>> df_light.iloc[i] >>> for timer in ti.reset('DF-Light._getrow'): >>> with timer: >>> for i in range(NUM): >>> df_light._getrow(i) NUM = 1000 ============== ====== to_dict conversions ===== Timed best=0.022 ms, mean=0.022 ± 0.0 ms for series Timed best=0.059 ms, mean=0.059 ± 0.0 ms for series+numpy Timed best=0.315 ms, mean=0.315 ± 0.0 ms for list Timed best=0.895 ms, mean=0.895 ± 0.0 ms for dict Timed best=2.705 ms, mean=2.705 ± 0.0 ms for split Timed best=5.474 ms, mean=5.474 ± 0.0 ms for records Timed best=7.320 ms, mean=7.320 ± 0.0 ms for index ============== ====== DFLight Conversions ======= Timed best=1.798 ms, mean=1.798 ± 0.0 ms for self.pandas Timed best=0.064 ms, mean=0.064 ± 0.0 ms for light-from-pandas Timed best=0.010 ms, mean=0.010 ± 0.0 ms for light-from-dict ============== ====== BENCHMARK: .LOC[] ======= Timed best=101.365 ms, mean=101.564 ± 0.2 ms for DF-heavy.iloc Timed best=102.038 ms, mean=102.273 ± 0.2 ms for DF-heavy.loc Timed best=29.357 ms, mean=29.449 ± 0.1 ms for dict[SERIES].loc Timed best=21.701 ms, mean=22.014 ± 0.3 ms for dict[SERIES].iloc Timed best=11.469 ms, mean=11.566 ± 0.1 ms for dict[SERIES][] Timed best=0.807 ms, mean=0.826 ± 0.0 ms for dict[NDARRAY][] Timed best=0.478 ms, mean=0.492 ± 0.0 ms for dict[list][] Timed best=0.969 ms, mean=0.994 ± 0.0 ms for DF-Light.iloc/loc Timed best=0.760 ms, mean=0.776 ± 0.0 ms for DF-Light._getrow
- property iloc(self)¶
- property values(self)¶
- property loc(self)¶
- __eq__(self, other)¶
Example
>>> # xdoctest: +REQUIRES(module:pandas) >>> self = DataFrameLight._demodata(num=7) >>> other = self.pandas() >>> assert np.all(self == other)
- to_string(self, *args, **kwargs)¶
- to_dict(self, orient='dict', into=dict)¶
Convert the data frame into a dictionary.
- Parameters
orient (str) – Currently naitively suports orient in {‘dict’, ‘list’}, otherwise we fallback to pandas conversion and call its to_dict method.
into (type) – type of dictionary to transform into
- Returns
dict
Example
>>> from kwarray.dataframe_light import * # NOQA >>> self = DataFrameLight._demodata(num=7) >>> print(self.to_dict(orient='dict')) >>> print(self.to_dict(orient='list'))
- pandas(self)¶
Convert back to pandas if you need the full API
Example
>>> # xdoctest: +REQUIRES(module:pandas) >>> df_light = DataFrameLight._demodata(num=7) >>> df_heavy = df_light.pandas() >>> got = DataFrameLight(df_heavy) >>> assert got._data == df_light._data
- _pandas(self)¶
Deprecated, use self.pandas instead
- classmethod _demodata(cls, num=7)¶
Example
>>> self = DataFrameLight._demodata(num=7) >>> print('self = {!r}'.format(self)) >>> other = DataFrameLight._demodata(num=11) >>> print('other = {!r}'.format(other)) >>> both = self.union(other) >>> print('both = {!r}'.format(both)) >>> assert both is not self >>> assert other is not self
- __nice__(self)¶
- __len__(self)¶
- __contains__(self, item)¶
- __normalize__(self)¶
Try to convert input data to Dict[List]
- property columns(self)¶
- sort_values(self, key, inplace=False, ascending=True)¶
- keys(self)¶
- _getrow(self, index)¶
- _getcol(self, key)¶
- _getcols(self, keys)¶
- get(self, key, default=None)¶
Get item for given key. Returns default value if not found.
- clear(self)¶
Removes all rows inplace
- __getitem__(self, key)¶
Note
only handles the case where key is a single column name.
Example
>>> df_light = DataFrameLight._demodata(num=7) >>> sub1 = df_light['bar'] >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> sub2 = df_heavy['bar'] >>> assert np.all(sub1 == sub2)
- __setitem__(self, key, value)¶
Note
only handles the case where key is a single column name. and value is an array of all the values to set.
Example
>>> df_light = DataFrameLight._demodata(num=7) >>> value = [2] * len(df_light) >>> df_light['bar'] = value >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> df_heavy['bar'] = value >>> assert np.all(df_light == df_heavy)
- compress(self, flags, inplace=False)¶
NOTE: NOT A PART OF THE PANDAS API
- take(self, indices, inplace=False)¶
Return the elements in the given positional indices along an axis.
- Parameters
inplace (bool) – NOT PART OF PANDAS API
Notes
assumes axis=0
Example
>>> df_light = DataFrameLight._demodata(num=7) >>> indices = [0, 2, 3] >>> sub1 = df_light.take(indices) >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> sub2 = df_heavy.take(indices) >>> assert np.all(sub1 == sub2)
- copy(self)¶
- extend(self, other)¶
Extend
self
inplace using another dataframe array- Parameters
other (DataFrameLight | dict[str, Sequence]) – values to concat to end of this object
Note
Not part of the pandas API
Example
>>> self = DataFrameLight(columns=['foo', 'bar']) >>> other = {'foo': [0], 'bar': [1]} >>> self.extend(other) >>> assert len(self) == 1
- union(self, *others)¶
Note
Note part of the pandas API
- classmethod concat(cls, others)¶
- classmethod from_pandas(cls, df)¶
- classmethod from_dict(cls, records)¶
- reset_index(self, drop=False)¶
noop for compatability, the light version doesnt store an index
- groupby(self, by=None, *args, **kwargs)¶
Group rows by the value of a column. Unlike pandas this simply returns a zip object. To ensure compatiability call list on the result of groupby.
- Parameters
by (str) – column name to group by
*args – if specified, the dataframe is coerced to pandas
*kwargs – if specified, the dataframe is coerced to pandas
Example
>>> df_light = DataFrameLight._demodata(num=7) >>> res1 = list(df_light.groupby('bar')) >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> res2 = list(df_heavy.groupby('bar')) >>> assert len(res1) == len(res2) >>> assert all([np.all(a[1] == b[1]) for a, b in zip(res1, res2)])
- Ignore:
>>> self = DataFrameLight._demodata(num=1000) >>> args = ['cx'] >>> self['cx'] = (np.random.rand(len(self)) * 10).astype(np.int) >>> # As expected, our custom restricted implementation is faster >>> # than pandas >>> ub.Timerit(100).call(lambda: dict(list(self.pandas().groupby('cx')))).print() >>> ub.Timerit(100).call(lambda: dict(self.groupby('cx'))).print()
- rename(self, mapper=None, columns=None, axis=None, inplace=False)¶
Rename the columns (index renaming is not supported)
Example
>>> df_light = DataFrameLight._demodata(num=7) >>> mapper = {'foo': 'fi'} >>> res1 = df_light.rename(columns=mapper) >>> res3 = df_light.rename(mapper, axis=1) >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> res2 = df_heavy.rename(columns=mapper) >>> res4 = df_heavy.rename(mapper, axis=1) >>> assert np.all(res1 == res2) >>> assert np.all(res3 == res2) >>> assert np.all(res3 == res4)
- iterrows(self)¶
Iterate over rows as (index, Dict) pairs.
- Yields
Tuple[int, Dict] – the index and a dictionary representing a row
Example
>>> from kwarray.dataframe_light import * # NOQA >>> self = DataFrameLight._demodata(num=3) >>> print(ub.repr2(list(self.iterrows()))) [ (0, {'bar': 0, 'baz': 2.73, 'foo': 0}), (1, {'bar': 1, 'baz': 2.73, 'foo': 0}), (2, {'bar': 2, 'baz': 2.73, 'foo': 0}), ]
- Benchmark:
>>> # xdoc: +REQUIRES(--bench) >>> from kwarray.dataframe_light import * # NOQA >>> import ubelt as ub >>> df_light = DataFrameLight._demodata(num=1000) >>> df_heavy = df_light.pandas() >>> ti = ub.Timerit(21, bestof=3, verbose=2, unit='ms') >>> ti.reset('light').call(lambda: list(df_light.iterrows())) >>> ti.reset('heavy').call(lambda: list(df_heavy.iterrows())) >>> # xdoctest: +IGNORE_WANT Timed light for: 21 loops, best of 3 time per loop: best=0.834 ms, mean=0.850 ± 0.0 ms Timed heavy for: 21 loops, best of 3 time per loop: best=45.007 ms, mean=45.633 ± 0.5 ms
- class kwarray.dataframe_light.DataFrameArray(data=None, columns=None)¶
Bases:
DataFrameLight
DataFrameLight assumes the backend is a Dict[list] DataFrameArray assumes the backend is a Dict[ndarray]
Take and compress are much faster, but extend and union are slower
- __normalize__(self)¶
Try to convert input data to Dict[ndarray]
- extend(self, other)¶
Extend
self
inplace using another dataframe array- Parameters
other (DataFrameLight | dict[str, Sequence]) – values to concat to end of this object
Note
Not part of the pandas API
Example
>>> self = DataFrameLight(columns=['foo', 'bar']) >>> other = {'foo': [0], 'bar': [1]} >>> self.extend(other) >>> assert len(self) == 1
- compress(self, flags, inplace=False)¶
NOTE: NOT A PART OF THE PANDAS API
- take(self, indices, inplace=False)¶
Return the elements in the given positional indices along an axis.
- Parameters
inplace (bool) – NOT PART OF PANDAS API
Notes
assumes axis=0
Example
>>> df_light = DataFrameLight._demodata(num=7) >>> indices = [0, 2, 3] >>> sub1 = df_light.take(indices) >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> sub2 = df_heavy.take(indices) >>> assert np.all(sub1 == sub2)