:py:mod:`kwarray.dataframe_light`
=================================

.. py:module:: kwarray.dataframe_light

.. autoapi-nested-parse::

   A faster-than-pandas pandas-like interface to column-major data, in the case
   where the data only needs to be accessed by index.

   For data where more complex ids are needed you must use pandas.


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   kwarray.dataframe_light.LocLight
   kwarray.dataframe_light.DataFrameLight
   kwarray.dataframe_light.DataFrameArray


Attributes
~~~~~~~~~~

.. autoapisummary::

   kwarray.dataframe_light.pd
   kwarray.dataframe_light.__version__


.. py:data:: pd
   

.. py:data:: __version__
   :annotation: = 0.0.1

   
.. py:class:: LocLight(parent)

   Bases: :py:obj:`object`

   .. py:method:: __getitem__(self, index)


.. py:class:: DataFrameLight(data=None, columns=None)

   Bases: :py:obj:`ubelt.NiceRepr`

   Implements a subset of the pandas.DataFrame API

   The API is restricted to facilitate speed tradeoffs

   .. rubric:: Notes

   Assumes underlying data is Dict[list|ndarray]. If the data is known
   to be a Dict[ndarray] use DataFrameArray instead, which has faster
   implementations for some operations.

   .. rubric:: Notes

   pandas.DataFrame is slow. DataFrameLight is faster.
   It is a tad more restrictive though.

   .. rubric:: Example

   >>> self = DataFrameLight({})
   >>> print('self = {!r}'.format(self))
   >>> self = DataFrameLight({'a': [0, 1, 2], 'b': [2, 3, 4]})
   >>> print('self = {!r}'.format(self))
   >>> item = self.iloc[0]
   >>> print('item = {!r}'.format(item))

   Benchmark:
       >>> # BENCHMARK
       >>> # xdoc: +REQUIRES(--bench)
       >>> from kwarray.dataframe_light import *  # NOQA
       >>> import ubelt as ub
       >>> NUM = 1000
       >>> print('NUM = {!r}'.format(NUM))
       >>> # to_dict conversions
       >>> print('==============')
       >>> print('====== to_dict conversions =====')
       >>> _keys = ['list', 'dict', 'series', 'split', 'records', 'index']
       >>> results = []
       >>> df = DataFrameLight._demodata(num=NUM).pandas()
       >>> ti = ub.Timerit(verbose=False, unit='ms')
       >>> for key in _keys:
       >>>     result = ti.reset(key).call(lambda: df.to_dict(orient=key))
       >>>     results.append((result.mean(), result.report()))
       >>> key = 'series+numpy'
       >>> result = ti.reset(key).call(lambda: {k: v.values for k, v in df.to_dict(orient='series').items()})
       >>> results.append((result.mean(), result.report()))
       >>> print('\n'.join([t[1] for t in sorted(results)]))
       >>> print('==============')
       >>> print('====== DFLight Conversions =======')
       >>> ti = ub.Timerit(verbose=True, unit='ms')
       >>> key = 'self.pandas'
       >>> self = DataFrameLight(df)
       >>> ti.reset(key).call(lambda: self.pandas())
       >>> key = 'light-from-pandas'
       >>> ti.reset(key).call(lambda: DataFrameLight(df))
       >>> key = 'light-from-dict'
       >>> ti.reset(key).call(lambda: DataFrameLight(self._data))
       >>> print('==============')
       >>> print('====== BENCHMARK: .LOC[] =======')
       >>> ti = ub.Timerit(num=20, bestof=4, verbose=True, unit='ms')
       >>> df_light = DataFrameLight._demodata(num=NUM)
       >>> # xdoctest: +REQUIRES(module:pandas)
       >>> df_heavy = df_light.pandas()
       >>> series_data = df_heavy.to_dict(orient='series')
       >>> list_data = df_heavy.to_dict(orient='list')
       >>> np_data = {k: v.values for k, v in df_heavy.to_dict(orient='series').items()}
       >>> for timer in ti.reset('DF-heavy.iloc'):
       >>>     with timer:
       >>>         for i in range(NUM):
       >>>             df_heavy.iloc[i]
       >>> for timer in ti.reset('DF-heavy.loc'):
       >>>     with timer:
       >>>         for i in range(NUM):
       >>>             df_heavy.iloc[i]
       >>> for timer in ti.reset('dict[SERIES].loc'):
       >>>     with timer:
       >>>         for i in range(NUM):
       >>>             {key: series_data[key].loc[i] for key in series_data.keys()}
       >>> for timer in ti.reset('dict[SERIES].iloc'):
       >>>     with timer:
       >>>         for i in range(NUM):
       >>>             {key: series_data[key].iloc[i] for key in series_data.keys()}
       >>> for timer in ti.reset('dict[SERIES][]'):
       >>>     with timer:
       >>>         for i in range(NUM):
       >>>             {key: series_data[key][i] for key in series_data.keys()}
       >>> for timer in ti.reset('dict[NDARRAY][]'):
       >>>     with timer:
       >>>         for i in range(NUM):
       >>>             {key: np_data[key][i] for key in np_data.keys()}
       >>> for timer in ti.reset('dict[list][]'):
       >>>     with timer:
       >>>         for i in range(NUM):
       >>>             {key: list_data[key][i] for key in np_data.keys()}
       >>> for timer in ti.reset('DF-Light.iloc/loc'):
       >>>     with timer:
       >>>         for i in range(NUM):
       >>>             df_light.iloc[i]
       >>> for timer in ti.reset('DF-Light._getrow'):
       >>>     with timer:
       >>>         for i in range(NUM):
       >>>             df_light._getrow(i)
       NUM = 1000
       ==============
       ====== to_dict conversions =====
       Timed best=0.022 ms, mean=0.022 ± 0.0 ms for series
       Timed best=0.059 ms, mean=0.059 ± 0.0 ms for series+numpy
       Timed best=0.315 ms, mean=0.315 ± 0.0 ms for list
       Timed best=0.895 ms, mean=0.895 ± 0.0 ms for dict
       Timed best=2.705 ms, mean=2.705 ± 0.0 ms for split
       Timed best=5.474 ms, mean=5.474 ± 0.0 ms for records
       Timed best=7.320 ms, mean=7.320 ± 0.0 ms for index
       ==============
       ====== DFLight Conversions =======
       Timed best=1.798 ms, mean=1.798 ± 0.0 ms for self.pandas
       Timed best=0.064 ms, mean=0.064 ± 0.0 ms for light-from-pandas
       Timed best=0.010 ms, mean=0.010 ± 0.0 ms for light-from-dict
       ==============
       ====== BENCHMARK: .LOC[] =======
       Timed best=101.365 ms, mean=101.564 ± 0.2 ms for DF-heavy.iloc
       Timed best=102.038 ms, mean=102.273 ± 0.2 ms for DF-heavy.loc
       Timed best=29.357 ms, mean=29.449 ± 0.1 ms for dict[SERIES].loc
       Timed best=21.701 ms, mean=22.014 ± 0.3 ms for dict[SERIES].iloc
       Timed best=11.469 ms, mean=11.566 ± 0.1 ms for dict[SERIES][]
       Timed best=0.807 ms, mean=0.826 ± 0.0 ms for dict[NDARRAY][]
       Timed best=0.478 ms, mean=0.492 ± 0.0 ms for dict[list][]
       Timed best=0.969 ms, mean=0.994 ± 0.0 ms for DF-Light.iloc/loc
       Timed best=0.760 ms, mean=0.776 ± 0.0 ms for DF-Light._getrow


   .. py:method:: iloc(self)
      :property:


   .. py:method:: values(self)
      :property:


   .. py:method:: loc(self)
      :property:


   .. py:method:: __eq__(self, other)

      .. rubric:: Example

      >>> # xdoctest: +REQUIRES(module:pandas)
      >>> self = DataFrameLight._demodata(num=7)
      >>> other = self.pandas()
      >>> assert np.all(self == other)


   .. py:method:: to_string(self, *args, **kwargs)


   .. py:method:: to_dict(self, orient='dict', into=dict)

      Convert the data frame into a dictionary.

      :Parameters: * **orient** (*str*) -- Currently naitively suports orient in
                     {'dict', 'list'}, otherwise we fallback to pandas conversion
                     and call its to_dict method.
                   * **into** (*type*) -- type of dictionary to transform into

      :returns: dict

      .. rubric:: Example

      >>> from kwarray.dataframe_light import *  # NOQA
      >>> self = DataFrameLight._demodata(num=7)
      >>> print(self.to_dict(orient='dict'))
      >>> print(self.to_dict(orient='list'))


   .. py:method:: pandas(self)

      Convert back to pandas if you need the full API

      .. rubric:: Example

      >>> # xdoctest: +REQUIRES(module:pandas)
      >>> df_light = DataFrameLight._demodata(num=7)
      >>> df_heavy = df_light.pandas()
      >>> got = DataFrameLight(df_heavy)
      >>> assert got._data == df_light._data


   .. py:method:: _pandas(self)

      Deprecated, use self.pandas instead


   .. py:method:: _demodata(cls, num=7)
      :classmethod:

      .. rubric:: Example

      >>> self = DataFrameLight._demodata(num=7)
      >>> print('self = {!r}'.format(self))
      >>> other = DataFrameLight._demodata(num=11)
      >>> print('other = {!r}'.format(other))
      >>> both = self.union(other)
      >>> print('both = {!r}'.format(both))
      >>> assert both is not self
      >>> assert other is not self


   .. py:method:: __nice__(self)


   .. py:method:: __len__(self)


   .. py:method:: __contains__(self, item)


   .. py:method:: __normalize__(self)

      Try to convert input data to Dict[List]


   .. py:method:: columns(self)
      :property:


   .. py:method:: sort_values(self, key, inplace=False, ascending=True)


   .. py:method:: keys(self)


   .. py:method:: _getrow(self, index)


   .. py:method:: _getcol(self, key)


   .. py:method:: _getcols(self, keys)


   .. py:method:: get(self, key, default=None)

      Get item for given key. Returns default value if not found.


   .. py:method:: clear(self)

      Removes all rows inplace


   .. py:method:: __getitem__(self, key)

      .. note:: only handles the case where key is a single column name.

      .. rubric:: Example

      >>> df_light = DataFrameLight._demodata(num=7)
      >>> sub1 = df_light['bar']
      >>> # xdoctest: +REQUIRES(module:pandas)
      >>> df_heavy = df_light.pandas()
      >>> sub2 = df_heavy['bar']
      >>> assert np.all(sub1 == sub2)


   .. py:method:: __setitem__(self, key, value)

      .. note::

         only handles the case where key is a single column name. and value
         is an array of all the values to set.

      .. rubric:: Example

      >>> df_light = DataFrameLight._demodata(num=7)
      >>> value = [2] * len(df_light)
      >>> df_light['bar'] = value
      >>> # xdoctest: +REQUIRES(module:pandas)
      >>> df_heavy = df_light.pandas()
      >>> df_heavy['bar'] = value
      >>> assert np.all(df_light == df_heavy)


   .. py:method:: compress(self, flags, inplace=False)

      NOTE: NOT A PART OF THE PANDAS API


   .. py:method:: take(self, indices, inplace=False)

      Return the elements in the given *positional* indices along an axis.

      :Parameters: **inplace** (*bool*) -- NOT PART OF PANDAS API

      .. rubric:: Notes

      assumes axis=0

      .. rubric:: Example

      >>> df_light = DataFrameLight._demodata(num=7)
      >>> indices = [0, 2, 3]
      >>> sub1 = df_light.take(indices)
      >>> # xdoctest: +REQUIRES(module:pandas)
      >>> df_heavy = df_light.pandas()
      >>> sub2 = df_heavy.take(indices)
      >>> assert np.all(sub1 == sub2)


   .. py:method:: copy(self)


   .. py:method:: extend(self, other)

      Extend ``self`` inplace using another dataframe array

      :Parameters: **other** (*DataFrameLight | dict[str, Sequence]*) -- values to concat to end of this object

      .. note:: Not part of the pandas API

      .. rubric:: Example

      >>> self = DataFrameLight(columns=['foo', 'bar'])
      >>> other = {'foo': [0], 'bar': [1]}
      >>> self.extend(other)
      >>> assert len(self) == 1


   .. py:method:: union(self, *others)

      .. note:: Note part of the pandas API


   .. py:method:: concat(cls, others)
      :classmethod:


   .. py:method:: from_pandas(cls, df)
      :classmethod:


   .. py:method:: from_dict(cls, records)
      :classmethod:


   .. py:method:: reset_index(self, drop=False)

      noop for compatability, the light version doesnt store an index


   .. py:method:: groupby(self, by=None, *args, **kwargs)

      Group rows by the value of a column. Unlike pandas this simply
      returns a zip object. To ensure compatiability call list on the
      result of groupby.

      :Parameters: * **by** (*str*) -- column name to group by
                   * **\*args** -- if specified, the dataframe is coerced to pandas
                   * **\*kwargs** -- if specified, the dataframe is coerced to pandas

      .. rubric:: Example

      >>> df_light = DataFrameLight._demodata(num=7)
      >>> res1 = list(df_light.groupby('bar'))
      >>> # xdoctest: +REQUIRES(module:pandas)
      >>> df_heavy = df_light.pandas()
      >>> res2 = list(df_heavy.groupby('bar'))
      >>> assert len(res1) == len(res2)
      >>> assert all([np.all(a[1] == b[1]) for a, b in zip(res1, res2)])

      Ignore:
          >>> self = DataFrameLight._demodata(num=1000)
          >>> args = ['cx']
          >>> self['cx'] = (np.random.rand(len(self)) * 10).astype(np.int)
          >>> # As expected, our custom restricted implementation is faster
          >>> # than pandas
          >>> ub.Timerit(100).call(lambda: dict(list(self.pandas().groupby('cx')))).print()
          >>> ub.Timerit(100).call(lambda: dict(self.groupby('cx'))).print()


   .. py:method:: rename(self, mapper=None, columns=None, axis=None, inplace=False)

      Rename the columns (index renaming is not supported)

      .. rubric:: Example

      >>> df_light = DataFrameLight._demodata(num=7)
      >>> mapper = {'foo': 'fi'}
      >>> res1 = df_light.rename(columns=mapper)
      >>> res3 = df_light.rename(mapper, axis=1)
      >>> # xdoctest: +REQUIRES(module:pandas)
      >>> df_heavy = df_light.pandas()
      >>> res2 = df_heavy.rename(columns=mapper)
      >>> res4 = df_heavy.rename(mapper, axis=1)
      >>> assert np.all(res1 == res2)
      >>> assert np.all(res3 == res2)
      >>> assert np.all(res3 == res4)


   .. py:method:: iterrows(self)

      Iterate over rows as (index, Dict) pairs.

      :Yields: *Tuple[int, Dict]* -- the index and a dictionary representing a row

      .. rubric:: Example

      >>> from kwarray.dataframe_light import *  # NOQA
      >>> self = DataFrameLight._demodata(num=3)
      >>> print(ub.repr2(list(self.iterrows())))
      [
          (0, {'bar': 0, 'baz': 2.73, 'foo': 0}),
          (1, {'bar': 1, 'baz': 2.73, 'foo': 0}),
          (2, {'bar': 2, 'baz': 2.73, 'foo': 0}),
      ]

      Benchmark:
          >>> # xdoc: +REQUIRES(--bench)
          >>> from kwarray.dataframe_light import *  # NOQA
          >>> import ubelt as ub
          >>> df_light = DataFrameLight._demodata(num=1000)
          >>> df_heavy = df_light.pandas()
          >>> ti = ub.Timerit(21, bestof=3, verbose=2, unit='ms')
          >>> ti.reset('light').call(lambda: list(df_light.iterrows()))
          >>> ti.reset('heavy').call(lambda: list(df_heavy.iterrows()))
          >>> # xdoctest: +IGNORE_WANT
          Timed light for: 21 loops, best of 3
              time per loop: best=0.834 ms, mean=0.850 ± 0.0 ms
          Timed heavy for: 21 loops, best of 3
              time per loop: best=45.007 ms, mean=45.633 ± 0.5 ms


.. py:class:: DataFrameArray(data=None, columns=None)

   Bases: :py:obj:`DataFrameLight`

   DataFrameLight assumes the backend is a Dict[list]
   DataFrameArray assumes the backend is a Dict[ndarray]

   Take and compress are much faster, but extend and union are slower

   .. py:method:: __normalize__(self)

      Try to convert input data to Dict[ndarray]


   .. py:method:: extend(self, other)

      Extend ``self`` inplace using another dataframe array

      :Parameters: **other** (*DataFrameLight | dict[str, Sequence]*) -- values to concat to end of this object

      .. note:: Not part of the pandas API

      .. rubric:: Example

      >>> self = DataFrameLight(columns=['foo', 'bar'])
      >>> other = {'foo': [0], 'bar': [1]}
      >>> self.extend(other)
      >>> assert len(self) == 1


   .. py:method:: compress(self, flags, inplace=False)

      NOTE: NOT A PART OF THE PANDAS API


   .. py:method:: take(self, indices, inplace=False)

      Return the elements in the given *positional* indices along an axis.

      :Parameters: **inplace** (*bool*) -- NOT PART OF PANDAS API

      .. rubric:: Notes

      assumes axis=0

      .. rubric:: Example

      >>> df_light = DataFrameLight._demodata(num=7)
      >>> indices = [0, 2, 3]
      >>> sub1 = df_light.take(indices)
      >>> # xdoctest: +REQUIRES(module:pandas)
      >>> df_heavy = df_light.pandas()
      >>> sub2 = df_heavy.take(indices)
      >>> assert np.all(sub1 == sub2)