Data Quality

class optimus.engines.base.columns.BaseColumns(root: optimus.helpers.types.DataFrameType)[source]

Base class for all Cols implementations

median(cols='*', relative_error=10000, tidy=True, compute=True)[source]

Returns the median of the values over the requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • relative_error

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute

Returns

Returns the median of the values over the requested columns

mean(cols='*', tidy=True, compute=True)[source]

Return the mean of the values over the requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

Column containing the cumulative sum.

var(cols='*', tidy=True, compute=True)[source]

Return unbiased variance over requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

std(cols='*', tidy=True, compute=True)[source]

Return unbiased variance over requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

min(cols='*', numeric=None, tidy: bool = True, compute: bool = True)[source]

Return the minimum value over one or one each column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • numeric – if True, cast to numeric before processing.

  • tidy – The result format. If True it will return a value if you process a column or column name and

value if not. If False it will return the functions name, the column name. and the value. :param compute: C :return:

max(cols='*', numeric=None, tidy: bool = True, compute: bool = True)[source]

Return the maximum value over one or one each column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • numeric – if True, cast to numeric before processing.

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

percentile(cols='*', values=None, relative_error=10000, estimate=True, tidy=True, compute=True)[source]

Return values at the given percentile over requested column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • values – Percentiles values you want to calculate. 0.25,0.5,0.75

  • relative_error

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

Return values at the given percentile over requested column.

iqr(cols='*', more=None, relative_error=10000, estimate=True)[source]

Return the column Inter Quartile Range value.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • more – Return info about q1 and q3

  • relative_error

Returns

Return the column Inter Quartile Range value.

hist(cols='*', buckets=32, compute=True) dict[source]

Return the histogram representation of the distribution of the data.

Parameters

cols – “*”, column name or list of column names to be processed.

:param buckets:Number of histogram bins to be used. :param compute: :return:

frequency(cols='*', n=32, percentage=False, total_rows=None, count_uniques=False, compute=True, tidy=False) dict[source]

Return the count of every element in the column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • n – numbers of bins to be returned.

  • percentage – if True calculate the

  • total_rows – If True returned the total count.

  • count_uniques – If True returned the number of uniques elements.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: dict with the count of every element in the column.

count() int[source]

Returns the number of columns in the dataframe.

Returns

Returns the number of columns in the dataframe.