Data Quality
- class optimus.engines.base.columns.BaseColumns(root: optimus.helpers.types.DataFrameType)[source]
Base class for all Cols implementations
- median(cols='*', relative_error=10000, tidy=True, compute=True)[source]
Returns the median of the values over the requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
relative_error –
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute –
- Returns
Returns the median of the values over the requested columns
- mean(cols='*', tidy=True, compute=True)[source]
Return the mean of the values over the requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.
- Returns
Column containing the cumulative sum.
- var(cols='*', tidy=True, compute=True)[source]
Return unbiased variance over requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.
compute – Compute the final result. False imply to return a delayed object.
- Returns
- std(cols='*', tidy=True, compute=True)[source]
Return unbiased variance over requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.
compute – Compute the final result. False imply to return a delayed object.
- Returns
- min(cols='*', numeric=None, tidy: bool = True, compute: bool = True)[source]
Return the minimum value over one or one each column.
- Parameters
cols – “*”, column name or list of column names to be processed.
numeric – if True, cast to numeric before processing.
tidy – The result format. If True it will return a value if you process a column or column name and
value if not. If False it will return the functions name, the column name. and the value. :param compute: C :return:
- max(cols='*', numeric=None, tidy: bool = True, compute: bool = True)[source]
Return the maximum value over one or one each column.
- Parameters
cols – “*”, column name or list of column names to be processed.
numeric – if True, cast to numeric before processing.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.
- Returns
- percentile(cols='*', values=None, relative_error=10000, estimate=True, tidy=True, compute=True)[source]
Return values at the given percentile over requested column.
- Parameters
cols – “*”, column name or list of column names to be processed.
values – Percentiles values you want to calculate. 0.25,0.5,0.75
relative_error –
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.
- Returns
Return values at the given percentile over requested column.
- iqr(cols='*', more=None, relative_error=10000, estimate=True)[source]
Return the column Inter Quartile Range value.
- Parameters
cols – “*”, column name or list of column names to be processed.
more – Return info about q1 and q3
relative_error –
- Returns
Return the column Inter Quartile Range value.
- hist(cols='*', buckets=32, compute=True) dict [source]
Return the histogram representation of the distribution of the data.
- Parameters
cols – “*”, column name or list of column names to be processed.
:param buckets:Number of histogram bins to be used. :param compute: :return:
- frequency(cols='*', n=32, percentage=False, total_rows=None, count_uniques=False, compute=True, tidy=False) dict [source]
Return the count of every element in the column.
- Parameters
cols – “*”, column name or list of column names to be processed.
n – numbers of bins to be returned.
percentage – if True calculate the
total_rows – If True returned the total count.
count_uniques – If True returned the number of uniques elements.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: dict with the count of every element in the column.