:py:mod:`kwarray.dataframe_light` ================================= .. py:module:: kwarray.dataframe_light .. autoapi-nested-parse:: A faster-than-pandas pandas-like interface to column-major data, in the case where the data only needs to be accessed by index. For data where more complex ids are needed you must use pandas. Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: kwarray.dataframe_light.LocLight kwarray.dataframe_light.DataFrameLight kwarray.dataframe_light.DataFrameArray Attributes ~~~~~~~~~~ .. autoapisummary:: kwarray.dataframe_light.pd kwarray.dataframe_light.__version__ .. py:data:: pd .. py:data:: __version__ :annotation: = 0.0.1 .. py:class:: LocLight(parent) Bases: :py:obj:`object` .. py:method:: __getitem__(self, index) .. py:class:: DataFrameLight(data=None, columns=None) Bases: :py:obj:`ubelt.NiceRepr` Implements a subset of the pandas.DataFrame API The API is restricted to facilitate speed tradeoffs .. rubric:: Notes Assumes underlying data is Dict[list|ndarray]. If the data is known to be a Dict[ndarray] use DataFrameArray instead, which has faster implementations for some operations. .. rubric:: Notes pandas.DataFrame is slow. DataFrameLight is faster. It is a tad more restrictive though. .. rubric:: Example >>> self = DataFrameLight({}) >>> print('self = {!r}'.format(self)) >>> self = DataFrameLight({'a': [0, 1, 2], 'b': [2, 3, 4]}) >>> print('self = {!r}'.format(self)) >>> item = self.iloc[0] >>> print('item = {!r}'.format(item)) Benchmark: >>> # BENCHMARK >>> # xdoc: +REQUIRES(--bench) >>> from kwarray.dataframe_light import * # NOQA >>> import ubelt as ub >>> NUM = 1000 >>> print('NUM = {!r}'.format(NUM)) >>> # to_dict conversions >>> print('==============') >>> print('====== to_dict conversions =====') >>> _keys = ['list', 'dict', 'series', 'split', 'records', 'index'] >>> results = [] >>> df = DataFrameLight._demodata(num=NUM).pandas() >>> ti = ub.Timerit(verbose=False, unit='ms') >>> for key in _keys: >>> result = ti.reset(key).call(lambda: df.to_dict(orient=key)) >>> results.append((result.mean(), result.report())) >>> key = 'series+numpy' >>> result = ti.reset(key).call(lambda: {k: v.values for k, v in df.to_dict(orient='series').items()}) >>> results.append((result.mean(), result.report())) >>> print('\n'.join([t[1] for t in sorted(results)])) >>> print('==============') >>> print('====== DFLight Conversions =======') >>> ti = ub.Timerit(verbose=True, unit='ms') >>> key = 'self.pandas' >>> self = DataFrameLight(df) >>> ti.reset(key).call(lambda: self.pandas()) >>> key = 'light-from-pandas' >>> ti.reset(key).call(lambda: DataFrameLight(df)) >>> key = 'light-from-dict' >>> ti.reset(key).call(lambda: DataFrameLight(self._data)) >>> print('==============') >>> print('====== BENCHMARK: .LOC[] =======') >>> ti = ub.Timerit(num=20, bestof=4, verbose=True, unit='ms') >>> df_light = DataFrameLight._demodata(num=NUM) >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> series_data = df_heavy.to_dict(orient='series') >>> list_data = df_heavy.to_dict(orient='list') >>> np_data = {k: v.values for k, v in df_heavy.to_dict(orient='series').items()} >>> for timer in ti.reset('DF-heavy.iloc'): >>> with timer: >>> for i in range(NUM): >>> df_heavy.iloc[i] >>> for timer in ti.reset('DF-heavy.loc'): >>> with timer: >>> for i in range(NUM): >>> df_heavy.iloc[i] >>> for timer in ti.reset('dict[SERIES].loc'): >>> with timer: >>> for i in range(NUM): >>> {key: series_data[key].loc[i] for key in series_data.keys()} >>> for timer in ti.reset('dict[SERIES].iloc'): >>> with timer: >>> for i in range(NUM): >>> {key: series_data[key].iloc[i] for key in series_data.keys()} >>> for timer in ti.reset('dict[SERIES][]'): >>> with timer: >>> for i in range(NUM): >>> {key: series_data[key][i] for key in series_data.keys()} >>> for timer in ti.reset('dict[NDARRAY][]'): >>> with timer: >>> for i in range(NUM): >>> {key: np_data[key][i] for key in np_data.keys()} >>> for timer in ti.reset('dict[list][]'): >>> with timer: >>> for i in range(NUM): >>> {key: list_data[key][i] for key in np_data.keys()} >>> for timer in ti.reset('DF-Light.iloc/loc'): >>> with timer: >>> for i in range(NUM): >>> df_light.iloc[i] >>> for timer in ti.reset('DF-Light._getrow'): >>> with timer: >>> for i in range(NUM): >>> df_light._getrow(i) NUM = 1000 ============== ====== to_dict conversions ===== Timed best=0.022 ms, mean=0.022 ± 0.0 ms for series Timed best=0.059 ms, mean=0.059 ± 0.0 ms for series+numpy Timed best=0.315 ms, mean=0.315 ± 0.0 ms for list Timed best=0.895 ms, mean=0.895 ± 0.0 ms for dict Timed best=2.705 ms, mean=2.705 ± 0.0 ms for split Timed best=5.474 ms, mean=5.474 ± 0.0 ms for records Timed best=7.320 ms, mean=7.320 ± 0.0 ms for index ============== ====== DFLight Conversions ======= Timed best=1.798 ms, mean=1.798 ± 0.0 ms for self.pandas Timed best=0.064 ms, mean=0.064 ± 0.0 ms for light-from-pandas Timed best=0.010 ms, mean=0.010 ± 0.0 ms for light-from-dict ============== ====== BENCHMARK: .LOC[] ======= Timed best=101.365 ms, mean=101.564 ± 0.2 ms for DF-heavy.iloc Timed best=102.038 ms, mean=102.273 ± 0.2 ms for DF-heavy.loc Timed best=29.357 ms, mean=29.449 ± 0.1 ms for dict[SERIES].loc Timed best=21.701 ms, mean=22.014 ± 0.3 ms for dict[SERIES].iloc Timed best=11.469 ms, mean=11.566 ± 0.1 ms for dict[SERIES][] Timed best=0.807 ms, mean=0.826 ± 0.0 ms for dict[NDARRAY][] Timed best=0.478 ms, mean=0.492 ± 0.0 ms for dict[list][] Timed best=0.969 ms, mean=0.994 ± 0.0 ms for DF-Light.iloc/loc Timed best=0.760 ms, mean=0.776 ± 0.0 ms for DF-Light._getrow .. py:method:: iloc(self) :property: .. py:method:: values(self) :property: .. py:method:: loc(self) :property: .. py:method:: __eq__(self, other) .. rubric:: Example >>> # xdoctest: +REQUIRES(module:pandas) >>> self = DataFrameLight._demodata(num=7) >>> other = self.pandas() >>> assert np.all(self == other) .. py:method:: to_string(self, *args, **kwargs) .. py:method:: to_dict(self, orient='dict', into=dict) Convert the data frame into a dictionary. :Parameters: * **orient** (*str*) -- Currently naitively suports orient in {'dict', 'list'}, otherwise we fallback to pandas conversion and call its to_dict method. * **into** (*type*) -- type of dictionary to transform into :returns: dict .. rubric:: Example >>> from kwarray.dataframe_light import * # NOQA >>> self = DataFrameLight._demodata(num=7) >>> print(self.to_dict(orient='dict')) >>> print(self.to_dict(orient='list')) .. py:method:: pandas(self) Convert back to pandas if you need the full API .. rubric:: Example >>> # xdoctest: +REQUIRES(module:pandas) >>> df_light = DataFrameLight._demodata(num=7) >>> df_heavy = df_light.pandas() >>> got = DataFrameLight(df_heavy) >>> assert got._data == df_light._data .. py:method:: _pandas(self) Deprecated, use self.pandas instead .. py:method:: _demodata(cls, num=7) :classmethod: .. rubric:: Example >>> self = DataFrameLight._demodata(num=7) >>> print('self = {!r}'.format(self)) >>> other = DataFrameLight._demodata(num=11) >>> print('other = {!r}'.format(other)) >>> both = self.union(other) >>> print('both = {!r}'.format(both)) >>> assert both is not self >>> assert other is not self .. py:method:: __nice__(self) .. py:method:: __len__(self) .. py:method:: __contains__(self, item) .. py:method:: __normalize__(self) Try to convert input data to Dict[List] .. py:method:: columns(self) :property: .. py:method:: sort_values(self, key, inplace=False, ascending=True) .. py:method:: keys(self) .. py:method:: _getrow(self, index) .. py:method:: _getcol(self, key) .. py:method:: _getcols(self, keys) .. py:method:: get(self, key, default=None) Get item for given key. Returns default value if not found. .. py:method:: clear(self) Removes all rows inplace .. py:method:: __getitem__(self, key) .. note:: only handles the case where key is a single column name. .. rubric:: Example >>> df_light = DataFrameLight._demodata(num=7) >>> sub1 = df_light['bar'] >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> sub2 = df_heavy['bar'] >>> assert np.all(sub1 == sub2) .. py:method:: __setitem__(self, key, value) .. note:: only handles the case where key is a single column name. and value is an array of all the values to set. .. rubric:: Example >>> df_light = DataFrameLight._demodata(num=7) >>> value = [2] * len(df_light) >>> df_light['bar'] = value >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> df_heavy['bar'] = value >>> assert np.all(df_light == df_heavy) .. py:method:: compress(self, flags, inplace=False) NOTE: NOT A PART OF THE PANDAS API .. py:method:: take(self, indices, inplace=False) Return the elements in the given *positional* indices along an axis. :Parameters: **inplace** (*bool*) -- NOT PART OF PANDAS API .. rubric:: Notes assumes axis=0 .. rubric:: Example >>> df_light = DataFrameLight._demodata(num=7) >>> indices = [0, 2, 3] >>> sub1 = df_light.take(indices) >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> sub2 = df_heavy.take(indices) >>> assert np.all(sub1 == sub2) .. py:method:: copy(self) .. py:method:: extend(self, other) Extend ``self`` inplace using another dataframe array :Parameters: **other** (*DataFrameLight | dict[str, Sequence]*) -- values to concat to end of this object .. note:: Not part of the pandas API .. rubric:: Example >>> self = DataFrameLight(columns=['foo', 'bar']) >>> other = {'foo': [0], 'bar': [1]} >>> self.extend(other) >>> assert len(self) == 1 .. py:method:: union(self, *others) .. note:: Note part of the pandas API .. py:method:: concat(cls, others) :classmethod: .. py:method:: from_pandas(cls, df) :classmethod: .. py:method:: from_dict(cls, records) :classmethod: .. py:method:: reset_index(self, drop=False) noop for compatability, the light version doesnt store an index .. py:method:: groupby(self, by=None, *args, **kwargs) Group rows by the value of a column. Unlike pandas this simply returns a zip object. To ensure compatiability call list on the result of groupby. :Parameters: * **by** (*str*) -- column name to group by * **\*args** -- if specified, the dataframe is coerced to pandas * **\*kwargs** -- if specified, the dataframe is coerced to pandas .. rubric:: Example >>> df_light = DataFrameLight._demodata(num=7) >>> res1 = list(df_light.groupby('bar')) >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> res2 = list(df_heavy.groupby('bar')) >>> assert len(res1) == len(res2) >>> assert all([np.all(a[1] == b[1]) for a, b in zip(res1, res2)]) Ignore: >>> self = DataFrameLight._demodata(num=1000) >>> args = ['cx'] >>> self['cx'] = (np.random.rand(len(self)) * 10).astype(np.int) >>> # As expected, our custom restricted implementation is faster >>> # than pandas >>> ub.Timerit(100).call(lambda: dict(list(self.pandas().groupby('cx')))).print() >>> ub.Timerit(100).call(lambda: dict(self.groupby('cx'))).print() .. py:method:: rename(self, mapper=None, columns=None, axis=None, inplace=False) Rename the columns (index renaming is not supported) .. rubric:: Example >>> df_light = DataFrameLight._demodata(num=7) >>> mapper = {'foo': 'fi'} >>> res1 = df_light.rename(columns=mapper) >>> res3 = df_light.rename(mapper, axis=1) >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> res2 = df_heavy.rename(columns=mapper) >>> res4 = df_heavy.rename(mapper, axis=1) >>> assert np.all(res1 == res2) >>> assert np.all(res3 == res2) >>> assert np.all(res3 == res4) .. py:method:: iterrows(self) Iterate over rows as (index, Dict) pairs. :Yields: *Tuple[int, Dict]* -- the index and a dictionary representing a row .. rubric:: Example >>> from kwarray.dataframe_light import * # NOQA >>> self = DataFrameLight._demodata(num=3) >>> print(ub.repr2(list(self.iterrows()))) [ (0, {'bar': 0, 'baz': 2.73, 'foo': 0}), (1, {'bar': 1, 'baz': 2.73, 'foo': 0}), (2, {'bar': 2, 'baz': 2.73, 'foo': 0}), ] Benchmark: >>> # xdoc: +REQUIRES(--bench) >>> from kwarray.dataframe_light import * # NOQA >>> import ubelt as ub >>> df_light = DataFrameLight._demodata(num=1000) >>> df_heavy = df_light.pandas() >>> ti = ub.Timerit(21, bestof=3, verbose=2, unit='ms') >>> ti.reset('light').call(lambda: list(df_light.iterrows())) >>> ti.reset('heavy').call(lambda: list(df_heavy.iterrows())) >>> # xdoctest: +IGNORE_WANT Timed light for: 21 loops, best of 3 time per loop: best=0.834 ms, mean=0.850 ± 0.0 ms Timed heavy for: 21 loops, best of 3 time per loop: best=45.007 ms, mean=45.633 ± 0.5 ms .. py:class:: DataFrameArray(data=None, columns=None) Bases: :py:obj:`DataFrameLight` DataFrameLight assumes the backend is a Dict[list] DataFrameArray assumes the backend is a Dict[ndarray] Take and compress are much faster, but extend and union are slower .. py:method:: __normalize__(self) Try to convert input data to Dict[ndarray] .. py:method:: extend(self, other) Extend ``self`` inplace using another dataframe array :Parameters: **other** (*DataFrameLight | dict[str, Sequence]*) -- values to concat to end of this object .. note:: Not part of the pandas API .. rubric:: Example >>> self = DataFrameLight(columns=['foo', 'bar']) >>> other = {'foo': [0], 'bar': [1]} >>> self.extend(other) >>> assert len(self) == 1 .. py:method:: compress(self, flags, inplace=False) NOTE: NOT A PART OF THE PANDAS API .. py:method:: take(self, indices, inplace=False) Return the elements in the given *positional* indices along an axis. :Parameters: **inplace** (*bool*) -- NOT PART OF PANDAS API .. rubric:: Notes assumes axis=0 .. rubric:: Example >>> df_light = DataFrameLight._demodata(num=7) >>> indices = [0, 2, 3] >>> sub1 = df_light.take(indices) >>> # xdoctest: +REQUIRES(module:pandas) >>> df_heavy = df_light.pandas() >>> sub2 = df_heavy.take(indices) >>> assert np.all(sub1 == sub2)