Columns

class optimus.engines.base.columns.BaseColumns(root: optimus.helpers.types.DataFrameType)[source]

Base class for all Cols implementations

abs(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the absolute numeric value of each value in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the absolute value of each element.

acos(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply arccosine function to a column. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return: Column containing the arccosine of each element.

acosh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the arcus hyperbolic cosine function to a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arcus hyperbolic cosine of each element.

add(cols='*', output_col=None) optimus.helpers.types.DataFrameType[source]

Apply a plus operation to two or more columns.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • output_col – Single output column in case no value is passed.

Returns

Dataframe with the result of the arithmetic operation appended.

agg_exprs(cols='*', funcs=None, *args, compute=True, tidy=True, parallel=False)[source]

Run a list of aggregation functions.

Parameters
  • cols – Column over with to apply the aggregations functions.

  • funcs – List of aggregation functions.

  • args

  • compute – Compute the result or return a delayed function.

  • tidy – Compact the dict output.

  • parallel – Execute the function in every column or apply it over the whole dataframe.

Returns

Return the calculates values from a list of aggregations functions.

any_greater_than(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]
Parameters
  • cols

  • value

  • inverse

  • tidy

  • compute

Returns

abstract append(dfs: optimus.helpers.types.DataFrameTypeList) optimus.helpers.types.DataFrameType[source]

Appends one or more columns or dataframes.

Parameters

dfs – DataFrame, list of dataframes or list of columns to append to the dataframe

Returns

DataFrame

apply(cols='*', func=None, func_return_type=None, args=None, func_type=None, where=None, filter_col_by_data_types=None, output_cols=None, skip_output_cols_processing=False, meta_action='apply_cols', mode='vectorized', set_index=False, default=None, **kwargs) optimus.helpers.types.DataFrameType[source]
Parameters
  • cols – “*”, column name or list of column names to be processed.

  • func

  • func_return_type

  • args

  • func_type

  • where

  • filter_col_by_data_types

  • output_cols – Column name or list of column names where the transformed data will be saved.

  • skip_output_cols_processing

  • meta_action

  • mode

  • set_index

  • default

  • kwargs

Returns

apply_by_data_types(cols='*', func=None, args=None, data_type=None) optimus.helpers.types.DataFrameType[source]

Apply a function using pandas udf or udf if apache arrow is not available.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • func – Functions to be applied to a columns

  • args

  • func – pandas_udf or udf. If ‘None’ try to use pandas udf (Pyarrow needed)

  • data_type

Returns

asin(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the arcsine function to a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arcsine of each element.

asinh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the arcus hyperbolic sine function to a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arcus hyperbolic sin of each element.

assign(cols: Optional[Union[str, list, dict]] = None, values=None, **kwargs)[source]

Assign new columns to a Dataframe.

Returns a DataFrame with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

Parameters
  • cols – A dict with the form {“col_name”: “value”}, a list of columns or a single column

  • values – When no dict is passed to ‘cols’, uses this parameter to get the values.

  • kwargs

Returns

abstract static astype(*args, **kwargs)[source]

Alias from cast function for compatibility with the pandas API.

Parameters
  • args

  • kwargs

Returns

atan(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the arctangent function to a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arctangent of each element.

atanh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the arcus hyperbolic tangent function to a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arcus hyperbolic tangent of each element.

bag_of_words(features, analyzer='word', ngram_range=2) optimus.helpers.types.DataFrameType[source]
Parameters
  • analyzer

  • features

  • ngram_range

Returns

boxplot(cols='*') dict[source]

Return the boxplot data in python dict format.

Parameters

cols – “*”, column name or list of column names to be processed.

Returns

dict with box plot data.

calculate_pattern_counts(cols='*', n=10, mode=0, flush=False) optimus.helpers.types.DataFrameType[source]

Counts how many equal patterns there are in a column. Uses a cache to trigger the operation only if necessary. Saves the result to meta and returns the same dataframe.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • n – Return the Top n matches.

  • mode – mode use to calculate the patterns.

  • flush – Flushes the cache to process again

Returns

capitalize(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Capitalize every word in a sentence.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

cast(cols=None, data_type=None, output_cols=None, *args, **kwargs) optimus.helpers.types.DataFrameType[source]

NOTE: We have two ways to cast the data. Use the use the native .astype() this is faster but can not handle some transformation like string to number in which should output nan.

Cast the elements inside a column or a list of columns to a specific data type. Unlike ‘cast’ this not change the columns data type

Parameters
  • cols – Columns names to be casted or, dictionary or list of tuples of column names and types to be casted with the following structure: cols = [(‘columnName1’, ‘integer’), (‘columnName2’, ‘float’), (‘columnName3’, ‘string’)] The first parameter in each tuple is the column name, the second is the final datatype of column after the transformation is made.

  • output_cols – Column name or list of column names where the transformed data will be saved.

  • data_type – final data type

  • args – passed to cast function (df.cols.to_integer(…, -1)).

  • kwargs – passed to cast function (df.cols.to_integer(…, default=-1)).

Returns

Return the casted columns.

ceil(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Round each number in a column up to the nearest integer.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the ceil of each element.

clip(cols='*', lower_bound=None, upper_bound=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Assigns values outside boundary to boundary values.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • lower_bound – Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

  • upper_bound – Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

concat(dfs: optimus.helpers.types.DataFrameTypeList) optimus.helpers.types.DataFrameType[source]

Same as append.

Parameters

dfs – DataFrame, list of dataframes or list of columns to append to the dataframe

Returns

DataFrame

copy(cols='*', output_cols=None, columns=None) optimus.helpers.types.DataFrameType[source]

Copy one or multiple columns.

Parameters
  • cols – Source column to be copied

  • output_cols – Column name or list of column names where the transformed data will be saved.

  • columns – tuple of column [(‘column1’,’column_copy’)(‘column1’,’column1_copy’)()]

Returns

correlation(cols='*', method='pearson', compute=True, tidy=True)[source]

Compute pairwise correlation of columns, excluding NA/null values.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • method

Method of correlation:

pearson : standard correlation coefficient kendall : Kendall Tau correlation coefficient spearman : Spearman rank correlation

callable: callable with input two 1d ndarrays and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

Parameters
  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return:

cos(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply cosine function to a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the cosine of each element.

cosh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the hyperbolic cosine function to a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the hyperbolic cosine of each element.

count() int[source]

Returns the number of columns in the dataframe.

Returns

Returns the number of columns in the dataframe.

count_array(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of lists in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_between(cols='*', lower_bound=None, upper_bound=None, equal=True, bounds=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements between and lower and upper bound in given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • lower_bound – Lower bound.

  • upper_bound – Upper bound.

  • equal

  • bounds

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_boolean(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number booleans in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_containing(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]
Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – Value used to evaluate the function.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_data_type(cols='*', data_type=None, inverse=False, tidy=True, compute=True)[source]

Count the number of mismatch values in a given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • data_type

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_datetime(cols='*', inverse=False, tidy=True, compute=True)[source]
Parameters
  • cols

  • inverse

  • tidy

  • compute

Returns

count_duplicated(cols='*', keep='first', inverse=False, tidy=True, compute=True)[source]
Parameters
  • cols

  • keep

  • inverse

  • tidy

  • compute

Returns

count_email(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like an email address in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_empty(cols='*', inverse=False, tidy=True, compute=True)[source]

Count the number of empty values in a given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_ending_with(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Counts the number of elements that ends with the given string.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – Value used to evaluate the function.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_equal(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements equal to a value in given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – Value used to evaluate the function.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_expression(value=None, inverse=False, tidy=True, compute=True)[source]
Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – Value used to evaluate the function.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_float(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of floats in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_gender(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like a gender in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_greater_than(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements greater or equal to a value in given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – Value used to evaluate the function.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_greater_than_equal(cols='*', value=None, inverse=False, compute=True, tidy=True)[source]

Count the number of elements greater than or equal to a value in given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – Value used to evaluate the function.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_http_code(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like http code in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_int(cols='*', inverse=False, tidy=True, compute=True)[source]

Count the number of integers in a column. :param cols: ‘*’, list of columns names or a single column name. :param inverse: Inverse the function selection. :param compute: Compute the result or return a delayed function. :param tidy: The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return:

count_ip(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like an ip address in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_less_than(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements smaller than to a value in given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – Value used to evaluate the function.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_less_than_equal(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements smaller than or equal to a value in given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – Value used to evaluate the function.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_match(cols='*', regex=None, data_type=None, inverse=False, tidy=True, compute=True)[source]

Counts the number of match values in a given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • data_type

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_match_pattern(cols='*', pattern=None, inverse=False, tidy=True, compute=True)[source]
Parameters
  • cols

  • pattern

  • inverse

  • tidy

  • compute

Returns

count_mismatch(cols='*', data_type=None, inverse=False, tidy=True, compute=True)[source]

Count the number of mismatch values in a given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • data_type

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_missings(cols='*', inverse=False, tidy=True, compute=True)[source]

Count the number of missing values in given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_nan(cols='*', inverse=False, tidy=True, compute=True)[source]

Count the number of ‘nan’ values in a given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_none(cols='*', inverse=False, tidy=True, compute=True)[source]

Count the number of ‘None’ values in given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_not_equal(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements not equal to a value in given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – Value used to evaluate the function.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_nulls(cols='*', inverse=False, tidy=True, compute=True)[source]

Count the number of ‘nulls’ values in a given column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_numeric(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the numeric elements in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_object(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts python object in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_phone_number(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like phone number in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_regex(cols='*', regex=None, inverse=False, tidy=True, compute=True)[source]

Counts the number of elements that match a regular expression.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • regex – regular expression.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_social_security_number(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like social security number in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_starting_with(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Counts the number of elements that start with the given string.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – Value used to evaluate the function.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_str(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_uniques(cols='*', estimate=False, compute=True, tidy=True) int[source]

Count the number of uniques values in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • estimate

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return:

count_url(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like an url address in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_values_in(cols='*', values=None, inverse=False, tidy=True, compute=True)[source]
Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – Value used to evaluate the function.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_zeros(cols='*', tidy=True, compute=True)[source]

Return the count of zeros by column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy

  • compute

Returns

count_zip_code(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like a zip code s in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • inverse – Inverse the function selection.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

cross_tab(col_x, col_y, output='dict', compute=True) dict[source]
Parameters
  • col_x

  • col_y

  • output

  • compute – Compute the result or return a delayed function.

Returns

cummax(cols='*', output_cols=None)[source]

Return cumulative maximum over a DataFrame or column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the cumulative maximum.

cummin(cols='*', output_cols=None)[source]

Return cumulative minimum over a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the cumulative minimum.

cumprod(cols='*', output_cols=None)[source]

Return cumulative product over a DataFrame or column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the cumulative product.

cumsum(cols='*', output_cols=None)[source]

Return cumulative sum over a DataFrame or column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the cumulative sum.

cut(cols='*', bins=None, labels=None, default=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • bins

  • labels

  • default

  • output_cols

Returns

data_type(cols='*', names=False, tidy=True) dict[source]

Return the column(s) data type as string.

Parameters
  • cols – Columns to be processed

  • names – Returns aliases for every type instead of its internal name

Returns

Return a dict of column and its respective data type.

date_format(cols='*', tidy=True, compute=True, cached=None, **kwargs)[source]

Get the date format from a column, compatible with ‘format_date’.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.

  • compute – Compute the final result. False imply to return a delayed object.

  • cached – {None, True, False}, Gets cached date_formats (True), calculates them (False) or a combination of both (None).

  • kwargs

Returns

date_formats(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Get the date format for every value in specified columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

BaseDataFrame

day(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Get the day from a date in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • format – String format

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

days_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the number of days between two dates.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • value

  • date_format

  • round

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

div(cols='*', output_col=None) optimus.helpers.types.DataFrameType[source]

Divide two or more columns.

Parameters
  • cols – ‘*’, list of columns names or a single column name

  • output_col – Single output column in case no value is passed

Returns

Dataframe with the result of the arithmetic operation appended.

domain(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Returns the domain string from a url. From https://www.hi-optimus.com it returns hi-optimus.com.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

double_metaphone(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

The Double Metaphone phonetic encoding algorithm is the second generation of this algorithm. It is called “Double” because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

drop(cols=None, regex=None, data_type=None) optimus.helpers.types.DataFrameType[source]

Drop a list of columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • regex – Regex expression to select the columns

  • data_type

Returns

duplicate(cols='*', output_cols=None, columns=None) optimus.helpers.types.DataFrameType[source]

Alias of copy function.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

  • columns – tuple of column [(‘column1’,’column_copy’)(‘column1’,’column1_copy’)()]

Returns

email_domain(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the domain from an email address. From optimus@mail.col it will return ‘mail’.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

email_username(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the username from an email address. From optimus@mail.col it will return ‘optimus’.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

exec_agg(exprs, compute=True)[source]

Execute one or multiple aggregations functions.

Parameters
  • exprs

  • compute – Compute the result or return a delayed function.

Returns

exp(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Return Euler’s number, e (~2.718) raised to the power of each value in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the absolute value of each element.

expand_contracted_words(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Expand contracted words.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

extract(cols='*', regex=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Extract a string that match a regular expression.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • regex – Regular expression

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

fill_na(cols='*', value=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Replace null data with a specified value.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • value – value to replace the nan/None values

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Returns the column filled with given value.

fingerprint(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Create the fingerprint for a column

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

floor(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Round each number in a column down to the nearest integer.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the floor of each element.

static format_agg(exprs)[source]
Parameters

exprs

Returns

format_date(cols='*', current_format=None, output_format=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

TODO: missing description

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • current_format

  • output_format

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

frequency(cols='*', n=32, percentage=False, total_rows=None, count_uniques=False, compute=True, tidy=False) dict[source]

Return the count of every element in the column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • n – numbers of bins to be returned.

  • percentage – if True calculate the

  • total_rows – If True returned the total count.

  • count_uniques – If True returned the number of uniques elements.

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: dict with the count of every element in the column.

get(cols='*', keys=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Return items from a dict over requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • keys – The value of the dict key that will be returned.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the value of the key selected.

groupby(by, agg) optimus.helpers.types.DataFrameType[source]

This helper function aims to help managing columns name in the aggregation output. Also how to handle ordering columns because dask can order columns.

Parameters
  • by – Column name.

  • agg – List of tuples with the form [(“agg”, “col”)]

Returns

heatmap(col_x, col_y, bins_x=10, bins_y=10, compute=True) dict[source]
Parameters
  • col_x

  • col_y

  • bins_x

  • bins_y

  • compute

Returns

hist(cols='*', buckets=32, compute=True) dict[source]

Return the histogram representation of the distribution of the data.

Parameters

cols – “*”, column name or list of column names to be processed.

:param buckets:Number of histogram bins to be used. :param compute: :return:

host(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Returns the host string from a url.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

hour(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Get the hour from a date in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • format – String format

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

hours_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the number of hours between two dates.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • value

  • date_format

  • round

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

impute(cols='*', data_type='auto', strategy='auto', fill_value=None, output_cols=None)[source]

Fill null values using a constant or any of the strategy available.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • data_type

    • If “auto”, detect if it’s continuous or categorical using the data type of the column.

    • If “continuous”, sets the data as continuous and if no ‘strategy’ is passed then the mean is used.

    • If “categorical”, sets the data as categorical and if no ‘strategy’ is passed then the most frequent value is used.

  • strategy

    • If “auto”, automatically selects a strategy depending on the data type passed or inferred on ‘data_type’.

    • If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.

    • If “median”, then replace missing values using the median along each column. Can only be used with numeric data.

    • If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.

    • If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

  • fill_value – constant to be used to fill null values

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Return the Column filled with the imputed values.

abstract static index_to_string(cols=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Maps a column of label indices back to a column containing the original labels as strings.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

infer_data_types(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]
Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols

Returns

infer_date_formats(cols='*', sample=200, tidy=True) dict[source]

Infer date formats in a dataframe from a sample. This function use Pandas no matter the engine you are using.

Parameters

cols – Columns in which you want to infer the datatype.

Returns

dict with the column and the inferred date format

infer_type(cols='*', sample=200, tidy=True) dict[source]

Infer data types in a dataframe from a sample. First it identify the data type of every value in every cell. After that it takes all ghe values apply som heuristic to try to better identify the datatype. This function use Pandas no matter the engine you are using.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • sample

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name and the value.

Returns

dict with the column and the inferred data type.

inferred_data_type(cols='*', use_internal=False, tidy=True)[source]

Get the inferred data types from the meta data.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • use_internal – If no inferred data type is found, return a translated internal data type instead of None.

  • tidy – The result format. If ‘True’ it will return a value if you ‘False’ will return the column name a value.

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: Python Dictionary with column names and its data types.

iqr(cols='*', more=None, relative_error=10000, estimate=True)[source]

Return the column Inter Quartile Range value.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • more – Return info about q1 and q3

  • relative_error

Returns

Return the column Inter Quartile Range value.

item(cols='*', n=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Return items from a list over requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • n – The position of the element that will be returned.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the value of the item selected.

join(df_right: optimus.helpers.types.DataFrameType, how='left', on=None, left_on=None, right_on=None, key_middle=False) optimus.helpers.types.DataFrameType[source]

Join two dataframes using a column.

Parameters
  • df_right – The dataframe that will be used to join the actual dataframe.

  • how – {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’

  • on – The column that will be used to join the two dataframes.

  • left_on – The column in the actual dataframe that will be used to make to make the join.

  • right_on – The column in the given dataframe that will be used to make to make the join.

  • key_middle – Order the columns putting the left df columns before the key column and the right df columns

Returns

Dataframe

keep(cols=None, regex=None) optimus.helpers.types.DataFrameType[source]

Drop a list of columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • regex – Regex expression to select the columns

Returns

kurtosis(cols='*', tidy=True, compute=True)[source]

Returns the kurtosis of the values over the requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

Returns the kurtosis of the values over the requested columns.

left(cols='*', n=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Get the substring from the first character to the nth from right to left.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • n – Number of character to get starting from 0.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

lemmatize_verbs(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Finding the lemma of a word depending on its meaning and context.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

len(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the length of every string in a column. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return:

levenshtein(cols='*', other_cols=None, value=None, output_cols=None)[source]

Calculate the levenshtein distance to a specified column. The Levenshtein distance is a string metric for measuring the difference between two sequences.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • other_cols

  • value

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

ln(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the natural logarithm of each value in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the natural logarithm of each element.

log(cols='*', base=10, output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the logarithm base 10 of each value in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • base

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the logarithm base 10 of each element.

lower(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Lowercase the specified columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

BaseDataFrame

mad(cols='*', relative_error=10000, more=False, estimate=True, tidy=True, compute=True)[source]
Parameters
  • cols – “*”, column name or list of column names to be processed.

  • relative_error

  • more

  • estimate

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :param compute: Compute the result or return a delayed function.

match_rating_codex(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

The match rating approach (MRA) is a phonetic algorithm developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

max(cols='*', numeric=None, tidy: bool = True, compute: bool = True)[source]

Return the maximum value over one or one each column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • numeric – if True, cast to numeric before processing.

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

max_abs_scaler(cols='*', output_cols=None)[source]

Scale each feature by its maximum absolute value.

Parameters
  • cols – ‘*’, list of columns names or a single column name

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

mean(cols='*', tidy=True, compute=True)[source]

Return the mean of the values over the requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

Column containing the cumulative sum.

median(cols='*', relative_error=10000, tidy=True, compute=True)[source]

Returns the median of the values over the requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • relative_error

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute

Returns

Returns the median of the values over the requested columns

metaphone(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the Metaphone algorithm to a specified column. Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

mid(cols='*', start=0, n=1, output_cols=None) optimus.helpers.types.DataFrameType[source]

Get the substring from

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • start

  • n

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

min(cols='*', numeric=None, tidy: bool = True, compute: bool = True)[source]

Return the minimum value over one or one each column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • numeric – if True, cast to numeric before processing.

  • tidy – The result format. If True it will return a value if you process a column or column name and

value if not. If False it will return the functions name, the column name. and the value. :param compute: C :return:

min_max_scaler(cols='*', output_cols=None)[source]

Transform features by scaling each feature to a given range.

Parameters
  • cols – ‘*’, list of columns names or a single column name

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

minute(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Get the minutes from a date in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • format – String format

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

minutes_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the number of minutes between two dates.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • value

  • date_format

  • round

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

mod(cols='*', divisor=2, output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the Modulo of each value in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • divisor

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing Molulo of each element.

mode(cols='*', tidy: bool = True, compute: bool = True)[source]

Return the mode value over.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

modified_z_score(cols='*', estimate=True, output_cols=None) optimus.helpers.types.DataFrameType[source]

Returns the modified z-score of the given columns.

Parameters
  • cols – ‘*’, list of columns names or a single column name

  • estimate

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Returns the modified z-score of the given columns.

month(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Get the month from a date in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • format – String format

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

months_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the number of months between two dates.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • value

  • date_format

  • round

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

move(column, position, ref_col=None) optimus.helpers.types.DataFrameType[source]

Move a column to a specific position.

Parameters
  • column – Column(s) to be moved

  • position – Column new position. Accepts ‘after’, ‘before’, ‘beginning’, ‘end’ or a numeric value, relative to ‘ref_col’.

  • ref_col – Column taken as reference

Returns

DataFrame

mul(cols='*', output_col=None) optimus.helpers.types.DataFrameType[source]

Multiply two or more columns.

Parameters
  • cols – ‘*’, list of columns names or a single column name

  • output_col – Single output column in case no value is passed

Returns

Dataframe with the result of the arithmetic operation appended.

names(cols='*', data_types=None, invert=False, is_regex=None) list[source]

Return the names of the columns.

Parameters
  • cols – Regex, “*” or columns to get.

  • data_types – returns only columns with matching data types

  • invert – invert column selection

  • is_regex – if True, forces cols regex as a regex

Returns

abstract static nest(cols, separator='', output_col=None, drop=True, shape='string') optimus.helpers.types.DataFrameType[source]

Concatenate two or more columns into one.

Parameters
  • cols – ‘*’, list of columns names or a single column name

  • separator

  • output_col – Column name or list of column names where the transformed data will be saved.

  • drop

  • shape

Returns

Columns with all the specified columns concatenated.

ngram_fingerprint(cols='*', n_size=2, output_cols=None) optimus.helpers.types.DataFrameType[source]

Calculate the ngram for a fingerprinted string.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • n_size – The ngram size.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

ngrams(cols='*', n_size=2, output_cols=None) optimus.helpers.types.DataFrameType[source]

Calculate the ngram for a fingerprinted string.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • n_size – The ngram size.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

normalize_chars(cols='*', output_cols=None)[source]

Remove diacritics from a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

normalize_spaces(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Remove extra whitespace between words and trim whitespace from the beginning and the end of each string.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

num_to_words(cols='*', language='en', output_cols=None) optimus.helpers.types.DataFrameType[source]

Convert numbers to its string representation.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • language

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column with number converted to its string representation.

nysiis(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the NYSIIS algorithm to a specified column. NYSIIS (New York State Identification and Intelligence System).

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

one_hot_encode(cols='*', prefix=None, drop=True, **kwargs) optimus.helpers.types.DataFrameType[source]

Maps a categorical column to multiple binary columns, with at most a single one-value. :param cols: Columns to be encoded. :param prefix: Prefix of the columns where the output is going to be saved. :param drop: :return: Dataframe with encoded columns.

pad(cols='*', width=0, fill_char='0', side='left', output_cols=None) optimus.helpers.types.DataFrameType[source]

Fill a string to match the given string length.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • width – Total length of the string.

  • fill_char – The char that will be used to fill the string.

  • side – Fill the left or the right side.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

parse_inferred_types(col_data_type)[source]

Parse a engine column specific data type to a profiler data type.

Parameters

col_data_type – Engine column specific data.

Returns

Dict

pattern(cols='*', output_cols=None, mode=0) optimus.helpers.types.DataFrameType[source]
Replace alphanumeric and punctuation chars for canned chars. We aim to help to find string patterns

c = Any alpha char in lower or upper case l = Any alpha char in lower case U = Any alpha char in upper case * = Any alphanumeric in lower or upper case. Used only in type 2 nd 3 # = Any numeric ! = Any punctuation

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

  • mode – 0: Identify lower, upper, digits. Except spaces and special chars. 1: Identify chars, digits. Except spaces and special chars 2: Identify Any alphanumeric. Except spaces and special chars 3: Identify alphanumeric and special chars. Except white spaces

pattern_counts(cols='*', n=10, mode=0, flush=False) dict[source]

Get how many equal patterns there are in a column. Triggers the operation only if necessary.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • n – Top n matches

  • mode

  • flush – Flushes the cache to process again

Returns

percentile(cols='*', values=None, relative_error=10000, estimate=True, tidy=True, compute=True)[source]

Return values at the given percentile over requested column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • values – Percentiles values you want to calculate. 0.25,0.5,0.75

  • relative_error

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

Return values at the given percentile over requested column.

port(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the port string from a url.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

pos(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word .

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

pow(cols='*', power=2, output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the power of each value in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • power

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the power of each element.

profile(cols='*', bins: int = 32, flush: bool = False) dict[source]

Returns the profile of selected columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • bins – Number of buckets.

  • flush – Flushes the cache of the whole profile to process it again.

Returns

Returns the profile of selected columns.

qcut(cols='*', quantiles=None, output_cols=None)[source]
Parameters
  • cols – “*”, column name or list of column names to be processed.

  • quantiles

  • output_cols

Returns

quality(cols='*', flush=False, compute=True) dict[source]

Return the data quality in the format {‘col_name’: {‘mismatch’: 0, ‘missing’: 9, ‘match’: 0, ‘inferred_data_type’: ‘object’}}

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • flush

  • compute

Returns

dict in the format {‘col_name’: {‘mismatch’: 0, ‘missing’: 9, ‘match’: 0, ‘inferred_data_type’: ‘object’}}

range(cols='*', tidy: bool = True, compute: bool = True)[source]

Return the minimum and maximum of the values over the requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

rdiv(cols='*', output_col=None) optimus.helpers.types.DataFrameType[source]

Divide two or more columns.

Parameters
  • cols – ‘*’, list of columns names or a single column name

  • output_col – Single output column in case no value is passed

Returns

Dataframe with the result of the arithmetic operation appended.

reciprocal(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the reciprocal(1/x) of of each value in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the reciprocal of each element.

remove(cols='*', search=None, search_by='chars', output_cols=None) optimus.helpers.types.DataFrameType[source]

Remove values from a string in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • search

  • search_by – Search by ‘chars’,

  • output_cols – Column name or list of column names where the transformed data will be saved.:param search:

Returns

remove_numbers(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Remove numbers from a string in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

remove_special_chars(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Remove special chars from a string in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

remove_stopwords(cols='*', language='english', output_cols=None) optimus.helpers.types.DataFrameType[source]

Remove extra whitespace between words and trim whitespace from the beginning and the end of each string.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • language – specify the stopwords language

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

remove_urls(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Remove urls from the one or more columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

remove_white_spaces(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Remove all white spaces from string in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

rename(cols: Union[str, list, dict] = '*', names: Optional[Union[str, list]] = None, func=None) optimus.helpers.types.DataFrameType[source]

Changes the name of a column(s) dataFrame.

Parameters
  • cols – string, dictionary or list of strings or tuples. Each tuple may have following form: (oldColumnName, newColumnName).

  • names – string or list of strings with new names of columns. Ignored if a dictionary or list of tuples is passed to cols.

  • func – can be lower, upper or any string transformation function.

Returns

Dataframe with columns names replaced.

replace(cols='*', search=None, replace_by=None, search_by=None, ignore_case=False, output_cols=None) optimus.helpers.types.DataFrameType[source]

Replace a value, list of values by a specified string.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • search – Values to look at to be replaced

  • replace_by – New value to replace the old one. Supports an array when searching by characters.

  • search_by – Can be “full”,”words”,”chars” or “values”.

  • ignore_case – Ignore case when searching for match

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

DataFrame

replace_regex(cols='*', search=None, replace_by=None, search_by=None, ignore_case=False, output_cols=None) optimus.helpers.types.DataFrameType[source]

Replace a value, list of values by a specified regex.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • search – Values to look at to be replaced

  • replace_by – New value to replace the old one. Supports an array when searching by characters.

  • search_by – Can be “full”,”words”,”chars” or “values”.

  • ignore_case – Ignore case when searching for match

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

abstract static reverse(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Reverse the order of the characters strings in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

right(cols='*', n=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Get the substring from the last character to n.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • n

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

round(cols='*', decimals=0, output_cols=None) optimus.helpers.types.DataFrameType[source]

Round a DataFrame to a variable number of decimal places.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • decimals – The number of decimals you want to

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the round of each element.

schema_data_type(cols='*', tidy=True)[source]

Return the column(s) data type as Type.

Parameters
  • cols – Columns to be processed

  • tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.

Returns

second(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Get the seconds from a date in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • format

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

seconds_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the number of seconds between two dates.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • value

  • date_format

  • round

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

select(cols='*', regex=None, data_type=None, invert=False, accepts_missing_cols=False) optimus.helpers.types.DataFrameType[source]

Select columns using index, column name, regex to data type.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • regex – Regular expression to filter the columns

  • data_type – Data type to be filtered for

  • invert – Invert the selection

  • accepts_missing_cols

Returns

set(cols='*', value_func=None, where: Optional[Union[str, optimus.helpers.types.MaskDataFrameType]] = None, args=None, default=None, eval_value: bool = False) optimus.helpers.types.DataFrameType[source]

Set a column value using a number, string or an expression.

Parameters
  • cols – Columns to set or create.

  • value_func – expression, function or value.

  • where – When the condition in ‘where’ is True, replace with ‘value_func’. Where False, replace with ‘default’ or keep the original value.

  • args – Argument when ‘value_func’ param is a function.

  • default – Entries where ‘where’ is False are replaced with corresponding value from other.

  • eval_value – Parse ‘value_func’ param in case a string is passed.

Returns

set_data_type(cols: Union[str, list, dict] = '*', data_types: Optional[Union[str, list]] = None, inferred: bool = False) optimus.helpers.types.DataFrameType[source]

Set profiler data type.

Parameters
  • cols – A dict with the form {“col_name”: profiler datatype}, a list of columns or a single column.

  • data_types – If a string or a list passed to cols, uses this parameter to set the data types to those columns.

  • inferred – Whether it was inferred or not.

Returns

Dataframe with new data types in the meta data.

set_date_format(cols: Union[str, list, dict] = '*', date_formats: Optional[Union[str, list]] = None, inferred: bool = False) optimus.helpers.types.DataFrameType[source]

Set date format.

Parameters
  • cols – A dict with the form {“col_name”: “date format”}, a list of columns or a single column

  • date_formats – If a string or a list passed to cols, uses this parameter to set the date format to those columns.

  • inferred – Whether it was inferred or not.

Returns

sin(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply sine function to a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the sine of each element.

sinh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the hyperbolic sine function to a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arctangent of each element.

skew(cols='*', tidy=True, compute=True)[source]

Return the skew of the values over the requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

Return the skew of the values over the requested columns.

slice(cols='*', start=None, stop=None, step=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Slice substrings from each element in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • start – Start position for slice operation.

  • stop – Stop position for slice operation.

  • step – Step size for slice operation.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

sort(order: Union[str, list] = 'asc', cols=None) optimus.helpers.types.DataFrameType[source]

Sort one or multiple columns in asc or desc order.

Parameters
  • order – ‘asc’ or ‘desc’ accepted

  • cols

Returns

Column containing the cumulative sum.

soundex(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the Soundex algorithm to a specified column. Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

sqrt(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the square root of each value in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the square root of each element.

standard_scaler(cols='*', output_cols=None)[source]

Standardize features by removing the mean and scaling to unit variance.

Parameters
  • cols – ‘*’, list of columns names or a single column name

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

std(cols='*', tidy=True, compute=True)[source]

Return unbiased variance over requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

stem_verbs(cols='*', stemmer: str = 'porter', language: str = 'english', output_cols=None) optimus.helpers.types.DataFrameType[source]
Parameters
  • cols – “*”, column name or list of column names to be processed.

  • stemmer – snowball, porter, lancaster

  • language

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

abstract static string_to_index(cols=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Encodes a string column of labels to a column of label indices.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

strip_html(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Remove HTML tags.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

sub(cols='*', output_col=None) optimus.helpers.types.DataFrameType[source]

Subtract two or more columns.

Parameters
  • cols – ‘*’, list of columns names or a single column name

  • output_col – Single output column in case no value is passed

Returns

Dataframe with the result of the arithmetic operation appended.

sub_domain(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Returns the subdomain string from a url. From https://www.hi-optimus.com it returns ‘www’.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

sum(cols='*', tidy=True, compute=True)[source]

Return the sum of the values over the requested column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

Column containing the sum of multiple columns.

tan(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the tangent function to a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the tangent of each element.

tanh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Apply the hyperbolic tangent function to a column. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return: Column containing the hyperbolic tangent of each element.

tf_idf(features) optimus.helpers.types.DataFrameType[source]
Parameters

features

Returns

time_between(cols='*', value=None, date_format=None, round=None, output_cols=None, func=None) optimus.helpers.types.DataFrameType[source]

Returns a TimeDelta of the units between two datetimes.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • value

  • date_format

  • round

  • output_cols – Column name or list of column names where the transformed data will be saved.

  • func – Custom function to pass to the apply, like self.F.days_between

Returns

title(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Capitalize the first word in a sentence.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

BaseDataFrame

to_boolean(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Cast the elements inside a column or a list of columns to boolean. :param cols: “*”, column name or list of column names to be processed. :param output_cols: :return:

to_datetime(cols='*', format=None, output_cols=None, transform_format=True) optimus.helpers.types.DataFrameType[source]

TODO:?

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • format

  • output_cols – Column name or list of column names where the transformed data will be saved.

  • transform_format

Returns

to_float(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Cast the elements inside a column or a list of columns to float. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return:

to_integer(cols='*', default=0, output_cols=None) optimus.helpers.types.DataFrameType[source]

Cast the elements inside a column or a list of columns to integer. :param cols: “*”, column name or list of column names to be processed. :param default: :param output_cols: Column name or list of column names where the transformed data will be saved. :return:

to_string(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Cast the elements inside a column or a list of columns to string. :param cols: “*”, column name or list of column names to be processed. :param output_cols: :return:

abstract static to_timestamp(cols, date_format=None, output_cols=None)[source]
Parameters
  • cols

  • date_format

  • output_cols

Returns

top_domain(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Returns the top domain string from a url. From ‘https://www.hi-optimus.com’ it returns ‘hi-optimus.com’.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

trim(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Remove leading and trailing characters.

Strip whitespaces (including newlines) or a set of specified characters from each string in the column from left and right sides. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return:

unique_values(cols='*', estimate=False, compute=True, tidy=True) list[source]

Return a list of uniques values in a column.

Parameters
  • cols – ‘*’, list of columns names or a single column name.

  • estimate

  • compute – Compute the result or return a delayed function.

  • tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value.

unnest(cols='*', separator=None, splits=2, index=None, output_cols=None, drop=False, mode='string') optimus.helpers.types.DataFrameType[source]

Split the columns values (array or string) in different columns.

Parameters
  • cols – Columns to be un-nested

  • output_cols – Resulted on or multiple columns after the unnest operation [(output_col_1_1,output_col_1_2),

(output_col_2_1, output_col_2] :param separator: char or regex :param splits: Number of columns splits. :param index: Return a specific index per columns. [1,2] :param drop: :param mode:

unset_data_type(cols='*')[source]

Unset user set data type.

Parameters

cols – ‘*’, list of columns names or a single column name.

Returns

unset_date_format(cols='*')[source]

Unset user defined date format.

Parameters

cols – ‘*’, list of columns names or a single column name.

Returns

upper(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Uppercase the specified columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

BaseDataFrame

url_file(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Returns the file string from a url. From https://www.hi-optimus.com/index.html it returns ‘index.html’.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

url_fragment(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]
Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

url_path(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Returns the top domain string from a url. From https://www.hi-optimus.com it returns ‘hi-optimus.com’.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

url_query(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Returns the query string from a url. From https://www.hi-optimus.com/?rollout=true it returns ‘roolout=true’.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

url_scheme(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Returns the top domain string from a url. From ‘https://www.hi-optimus.com’ it returns ‘https’.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

var(cols='*', tidy=True, compute=True)[source]

Return unbiased variance over requested columns.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.

  • compute – Compute the final result. False imply to return a delayed object.

Returns

weekday(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Get the hour from a date in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • format

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

word_count(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Count the number of words in a paragraph.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

word_tokenize(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]
Parameters
  • cols – “*”, column name or list of column names to be processed.

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

year(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Get the Year from a date in a column.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • format – String format

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

years_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType[source]

Return the number of years between two dates.

Parameters
  • cols – “*”, column name or list of column names to be processed.

  • value

  • date_format

  • round

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

z_score(cols='*', output_cols=None) optimus.helpers.types.DataFrameType[source]

Returns the z-score of the given columns.

Parameters
  • cols – ‘*’, list of columns names or a single column name

  • output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Dataframe with the z-score of the given columns appended.