Columns

class optimus.engines.base.columns.BaseColumns(root: optimus.helpers.types.DataFrameType)[source]

Base class for all Cols implementations

abs(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the absolute numeric value of each value in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the absolute value of each element.

acos(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]: Apply arccosine function to a column. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return: Column containing the arccosine of each element.

acosh(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply the arcus hyperbolic cosine function to a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arcus hyperbolic cosine of each element.

add(cols='*', output_col=None) → optimus.helpers.types.DataFrameType[source]

Apply a plus operation to two or more columns.

Parameters

cols – ‘*’, list of columns names or a single column name.
output_col – Single output column in case no value is passed.

Returns

Dataframe with the result of the arithmetic operation appended.

agg_exprs(cols='*', funcs=None, *args, compute=True, tidy=True, parallel=False)[source]

Run a list of aggregation functions.

Parameters

cols – Column over with to apply the aggregations functions.
funcs – List of aggregation functions.
args –
compute – Compute the result or return a delayed function.
tidy – Compact the dict output.
parallel – Execute the function in every column or apply it over the whole dataframe.

Returns

Return the calculates values from a list of aggregations functions.

any_greater_than(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Parameters

cols –
value –
inverse –
tidy –
compute –

Returns

abstract append(dfs: optimus.helpers.types.DataFrameTypeList) → optimus.helpers.types.DataFrameType[source]

Appends one or more columns or dataframes.

Parameters: dfs – DataFrame, list of dataframes or list of columns to append to the dataframe
Returns: DataFrame

apply(cols='*', func=None, func_return_type=None, args=None, func_type=None, where=None, filter_col_by_data_types=None, output_cols=None, skip_output_cols_processing=False, meta_action='apply_cols', mode='vectorized', set_index=False, default=None, **kwargs) → optimus.helpers.types.DataFrameType[source]

Parameters

cols – “*”, column name or list of column names to be processed.
func –
func_return_type –
args –
func_type –
where –
filter_col_by_data_types –
output_cols – Column name or list of column names where the transformed data will be saved.
skip_output_cols_processing –
meta_action –
mode –
set_index –
default –
kwargs –

Returns

apply_by_data_types(cols='*', func=None, args=None, data_type=None) → optimus.helpers.types.DataFrameType[source]

Apply a function using pandas udf or udf if apache arrow is not available.

Parameters

cols – “*”, column name or list of column names to be processed.
func – Functions to be applied to a columns
args –
func – pandas_udf or udf. If ‘None’ try to use pandas udf (Pyarrow needed)
data_type –

Returns

asin(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply the arcsine function to a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arcsine of each element.

asinh(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply the arcus hyperbolic sine function to a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arcus hyperbolic sin of each element.

assign(cols: Optional[Union[str, list, dict]] = None, values=None, **kwargs)[source]

Assign new columns to a Dataframe.

Returns a DataFrame with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

Parameters

cols – A dict with the form {“col_name”: “value”}, a list of columns or a single column
values – When no dict is passed to ‘cols’, uses this parameter to get the values.
kwargs –

Returns

abstract static astype(*args, **kwargs)[source]

Alias from cast function for compatibility with the pandas API.

Parameters

args –
kwargs –

Returns

atan(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply the arctangent function to a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arctangent of each element.

atanh(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply the arcus hyperbolic tangent function to a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arcus hyperbolic tangent of each element.

bag_of_words(features, analyzer='word', ngram_range=2) → optimus.helpers.types.DataFrameType[source]

Parameters

analyzer –
features –
ngram_range –

Returns

boxplot(cols='*') → dict[source]

Return the boxplot data in python dict format.

Parameters: cols – “*”, column name or list of column names to be processed.
Returns: dict with box plot data.

calculate_pattern_counts(cols='*', n=10, mode=0, flush=False) → optimus.helpers.types.DataFrameType[source]

Counts how many equal patterns there are in a column. Uses a cache to trigger the operation only if necessary. Saves the result to meta and returns the same dataframe.

Parameters

cols – “*”, column name or list of column names to be processed.
n – Return the Top n matches.
mode – mode use to calculate the patterns.
flush – Flushes the cache to process again

Returns

capitalize(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Capitalize every word in a sentence.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

cast(cols=None, data_type=None, output_cols=None, *args, **kwargs) → optimus.helpers.types.DataFrameType[source]

NOTE: We have two ways to cast the data. Use the use the native .astype() this is faster but can not handle some transformation like string to number in which should output nan.

Cast the elements inside a column or a list of columns to a specific data type. Unlike ‘cast’ this not change the columns data type

Parameters

cols – Columns names to be casted or, dictionary or list of tuples of column names and types to be casted with the following structure: cols = [(‘columnName1’, ‘integer’), (‘columnName2’, ‘float’), (‘columnName3’, ‘string’)] The first parameter in each tuple is the column name, the second is the final datatype of column after the transformation is made.
output_cols – Column name or list of column names where the transformed data will be saved.
data_type – final data type
args – passed to cast function (df.cols.to_integer(…, -1)).
kwargs – passed to cast function (df.cols.to_integer(…, default=-1)).

Returns

Return the casted columns.

ceil(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Round each number in a column up to the nearest integer.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the ceil of each element.

clip(cols='*', lower_bound=None, upper_bound=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Assigns values outside boundary to boundary values.

Parameters

cols – “*”, column name or list of column names to be processed.
lower_bound – Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.
upper_bound – Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

concat(dfs: optimus.helpers.types.DataFrameTypeList) → optimus.helpers.types.DataFrameType[source]

Same as append.

Parameters: dfs – DataFrame, list of dataframes or list of columns to append to the dataframe
Returns: DataFrame

copy(cols='*', output_cols=None, columns=None) → optimus.helpers.types.DataFrameType[source]

Copy one or multiple columns.

Parameters

cols – Source column to be copied
output_cols – Column name or list of column names where the transformed data will be saved.
columns – tuple of column [(‘column1’,’column_copy’)(‘column1’,’column1_copy’)()]

Returns

correlation(cols='*', method='pearson', compute=True, tidy=True)[source]

Compute pairwise correlation of columns, excluding NA/null values.

Parameters

cols – “*”, column name or list of column names to be processed.
method –

Method of correlation:

pearson : standard correlation coefficient kendall : Kendall Tau correlation coefficient spearman : Spearman rank correlation

callable: callable with input two 1d ndarrays and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

Parameters

compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return:

cos(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply cosine function to a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the cosine of each element.

cosh(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply the hyperbolic cosine function to a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the hyperbolic cosine of each element.

count() → int[source]

Returns the number of columns in the dataframe.

Returns: Returns the number of columns in the dataframe.

count_array(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of lists in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_between(cols='*', lower_bound=None, upper_bound=None, equal=True, bounds=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements between and lower and upper bound in given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
lower_bound – Lower bound.
upper_bound – Upper bound.
equal –
bounds –
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_boolean(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number booleans in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_containing(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Parameters

cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_data_type(cols='*', data_type=None, inverse=False, tidy=True, compute=True)[source]

Count the number of mismatch values in a given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
data_type –
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_datetime(cols='*', inverse=False, tidy=True, compute=True)[source]

Parameters

cols –
inverse –
tidy –
compute –

Returns

count_duplicated(cols='*', keep='first', inverse=False, tidy=True, compute=True)[source]

Parameters

cols –
keep –
inverse –
tidy –
compute –

Returns

count_email(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like an email address in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_empty(cols='*', inverse=False, tidy=True, compute=True)[source]

Count the number of empty values in a given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_ending_with(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Counts the number of elements that ends with the given string.

Parameters

cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_equal(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements equal to a value in given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_expression(value=None, inverse=False, tidy=True, compute=True)[source]

Parameters

cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_float(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of floats in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_gender(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like a gender in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_greater_than(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements greater or equal to a value in given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_greater_than_equal(cols='*', value=None, inverse=False, compute=True, tidy=True)[source]

Count the number of elements greater than or equal to a value in given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_http_code(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like http code in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_int(cols='*', inverse=False, tidy=True, compute=True)[source]: Count the number of integers in a column. :param cols: ‘*’, list of columns names or a single column name. :param inverse: Inverse the function selection. :param compute: Compute the result or return a delayed function. :param tidy: The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return:

count_ip(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like an ip address in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_less_than(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements smaller than to a value in given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_less_than_equal(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements smaller than or equal to a value in given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_match(cols='*', regex=None, data_type=None, inverse=False, tidy=True, compute=True)[source]

Counts the number of match values in a given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
data_type –
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_match_pattern(cols='*', pattern=None, inverse=False, tidy=True, compute=True)[source]

Parameters

cols –
pattern –
inverse –
tidy –
compute –

Returns

count_mismatch(cols='*', data_type=None, inverse=False, tidy=True, compute=True)[source]

Count the number of mismatch values in a given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
data_type –
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_missings(cols='*', inverse=False, tidy=True, compute=True)[source]

Count the number of missing values in given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_nan(cols='*', inverse=False, tidy=True, compute=True)[source]

Count the number of ‘nan’ values in a given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_none(cols='*', inverse=False, tidy=True, compute=True)[source]

Count the number of ‘None’ values in given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_not_equal(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Count the number of elements not equal to a value in given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_nulls(cols='*', inverse=False, tidy=True, compute=True)[source]

Count the number of ‘nulls’ values in a given column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_numeric(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the numeric elements in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_object(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts python object in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_phone_number(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like phone number in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_regex(cols='*', regex=None, inverse=False, tidy=True, compute=True)[source]

Counts the number of elements that match a regular expression.

Parameters

cols – ‘*’, list of columns names or a single column name.
regex – regular expression.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_social_security_number(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like social security number in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_starting_with(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]

Counts the number of elements that start with the given string.

Parameters

cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_str(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_uniques(cols='*', estimate=False, compute=True, tidy=True) → int[source]

Count the number of uniques values in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
estimate –
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return:

count_url(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like an url address in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_values_in(cols='*', values=None, inverse=False, tidy=True, compute=True)[source]

Parameters

cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

count_zeros(cols='*', tidy=True, compute=True)[source]

Return the count of zeros by column.

Parameters

cols – “*”, column name or list of column names to be processed.
tidy –
compute –

Returns

count_zip_code(cols='*', inverse=False, tidy=True, compute=True)[source]

Counts the number of strings that look like a zip code s in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.

cross_tab(col_x, col_y, output='dict', compute=True) → dict[source]

Parameters

col_x –
col_y –
output –
compute – Compute the result or return a delayed function.

Returns

cummax(cols='*', output_cols=None)[source]

Return cumulative maximum over a DataFrame or column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the cumulative maximum.

cummin(cols='*', output_cols=None)[source]

Return cumulative minimum over a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the cumulative minimum.

cumprod(cols='*', output_cols=None)[source]

Return cumulative product over a DataFrame or column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the cumulative product.

cumsum(cols='*', output_cols=None)[source]

Return cumulative sum over a DataFrame or column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the cumulative sum.

cut(cols='*', bins=None, labels=None, default=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

Parameters

cols – “*”, column name or list of column names to be processed.
bins –
labels –
default –
output_cols –

Returns

data_type(cols='*', names=False, tidy=True) → dict[source]

Return the column(s) data type as string.

Parameters

cols – Columns to be processed
names – Returns aliases for every type instead of its internal name

Returns

Return a dict of column and its respective data type.

date_format(cols='*', tidy=True, compute=True, cached=None, **kwargs)[source]

Get the date format from a column, compatible with ‘format_date’.

Parameters

cols – “*”, column name or list of column names to be processed.
tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.
compute – Compute the final result. False imply to return a delayed object.
cached – {None, True, False}, Gets cached date_formats (True), calculates them (False) or a combination of both (None).
kwargs –

Returns

date_formats(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Get the date format for every value in specified columns.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

BaseDataFrame

day(cols='*', format: Optional[str] = None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Get the day from a date in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
format – String format
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

days_between(cols='*', value=None, date_format=None, round=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the number of days between two dates.

Parameters

cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

div(cols='*', output_col=None) → optimus.helpers.types.DataFrameType[source]

Divide two or more columns.

Parameters

cols – ‘*’, list of columns names or a single column name
output_col – Single output column in case no value is passed

Returns

Dataframe with the result of the arithmetic operation appended.

domain(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Returns the domain string from a url. From https://www.hi-optimus.com it returns hi-optimus.com.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

double_metaphone(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

The Double Metaphone phonetic encoding algorithm is the second generation of this algorithm. It is called “Double” because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry

Parameters

cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

drop(cols=None, regex=None, data_type=None) → optimus.helpers.types.DataFrameType[source]

Drop a list of columns.

Parameters

cols – “*”, column name or list of column names to be processed.
regex – Regex expression to select the columns
data_type –

Returns

duplicate(cols='*', output_cols=None, columns=None) → optimus.helpers.types.DataFrameType[source]

Alias of copy function.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
columns – tuple of column [(‘column1’,’column_copy’)(‘column1’,’column1_copy’)()]

Returns

email_domain(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the domain from an email address. From optimus@mail.col it will return ‘mail’.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

email_username(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the username from an email address. From optimus@mail.col it will return ‘optimus’.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

exec_agg(exprs, compute=True)[source]

Execute one or multiple aggregations functions.

Parameters

exprs –
compute – Compute the result or return a delayed function.

Returns

exp(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return Euler’s number, e (~2.718) raised to the power of each value in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the absolute value of each element.

expand_contracted_words(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Expand contracted words.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

extract(cols='*', regex=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Extract a string that match a regular expression.

Parameters

cols – “*”, column name or list of column names to be processed.
regex – Regular expression
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

fill_na(cols='*', value=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Replace null data with a specified value.

Parameters

cols – ‘*’, list of columns names or a single column name.
value – value to replace the nan/None values
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Returns the column filled with given value.

fingerprint(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Create the fingerprint for a column

Parameters

cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

floor(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Round each number in a column down to the nearest integer.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the floor of each element.

static format_agg(exprs)[source]

Parameters: exprs –
Returns

format_date(cols='*', current_format=None, output_format=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

TODO: missing description

Parameters

cols – “*”, column name or list of column names to be processed.
current_format –
output_format –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

frequency(cols='*', n=32, percentage=False, total_rows=None, count_uniques=False, compute=True, tidy=False) → dict[source]

Return the count of every element in the column.

Parameters

cols – “*”, column name or list of column names to be processed.
n – numbers of bins to be returned.
percentage – if True calculate the
total_rows – If True returned the total count.
count_uniques – If True returned the number of uniques elements.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: dict with the count of every element in the column.

get(cols='*', keys=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return items from a dict over requested columns.

Parameters

cols – “*”, column name or list of column names to be processed.
keys – The value of the dict key that will be returned.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the value of the key selected.

groupby(by, agg) → optimus.helpers.types.DataFrameType[source]

This helper function aims to help managing columns name in the aggregation output. Also how to handle ordering columns because dask can order columns.

Parameters

by – Column name.
agg – List of tuples with the form [(“agg”, “col”)]

Returns

heatmap(col_x, col_y, bins_x=10, bins_y=10, compute=True) → dict[source]

Parameters

col_x –
col_y –
bins_x –
bins_y –
compute –

Returns

hist(cols='*', buckets=32, compute=True) → dict[source]

Return the histogram representation of the distribution of the data.

Parameters: cols – “*”, column name or list of column names to be processed.

:param buckets:Number of histogram bins to be used. :param compute: :return:

host(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Returns the host string from a url.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

hour(cols='*', format: Optional[str] = None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Get the hour from a date in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
format – String format
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

hours_between(cols='*', value=None, date_format=None, round=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the number of hours between two dates.

Parameters

cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

impute(cols='*', data_type='auto', strategy='auto', fill_value=None, output_cols=None)[source]

Fill null values using a constant or any of the strategy available.

Parameters

cols – “*”, column name or list of column names to be processed.
data_type –
- If “auto”, detect if it’s continuous or categorical using the data type of the column.
- If “continuous”, sets the data as continuous and if no ‘strategy’ is passed then the mean is used.
- If “categorical”, sets the data as categorical and if no ‘strategy’ is passed then the most frequent value is used.
strategy –
- If “auto”, automatically selects a strategy depending on the data type passed or inferred on ‘data_type’.
- If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
- If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
- If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.
- If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
fill_value – constant to be used to fill null values
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Return the Column filled with the imputed values.

abstract static index_to_string(cols=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Maps a column of label indices back to a column containing the original labels as strings.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

infer_data_types(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols –

Returns

infer_date_formats(cols='*', sample=200, tidy=True) → dict[source]

Infer date formats in a dataframe from a sample. This function use Pandas no matter the engine you are using.

Parameters: cols – Columns in which you want to infer the datatype.
Returns: dict with the column and the inferred date format

infer_type(cols='*', sample=200, tidy=True) → dict[source]

Infer data types in a dataframe from a sample. First it identify the data type of every value in every cell. After that it takes all ghe values apply som heuristic to try to better identify the datatype. This function use Pandas no matter the engine you are using.

Parameters

cols – “*”, column name or list of column names to be processed.
sample –
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name and the value.

Returns

dict with the column and the inferred data type.

inferred_data_type(cols='*', use_internal=False, tidy=True)[source]

Get the inferred data types from the meta data.

Parameters

cols – “*”, column name or list of column names to be processed.
use_internal – If no inferred data type is found, return a translated internal data type instead of None.
tidy – The result format. If ‘True’ it will return a value if you ‘False’ will return the column name a value.

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: Python Dictionary with column names and its data types.

iqr(cols='*', more=None, relative_error=10000, estimate=True)[source]

Return the column Inter Quartile Range value.

Parameters

cols – “*”, column name or list of column names to be processed.
more – Return info about q1 and q3
relative_error –

Returns

Return the column Inter Quartile Range value.

item(cols='*', n=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return items from a list over requested columns.

Parameters

cols – “*”, column name or list of column names to be processed.
n – The position of the element that will be returned.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the value of the item selected.

join(df_right: optimus.helpers.types.DataFrameType, how='left', on=None, left_on=None, right_on=None, key_middle=False) → optimus.helpers.types.DataFrameType[source]

Join two dataframes using a column.

Parameters

df_right – The dataframe that will be used to join the actual dataframe.
how – {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’
on – The column that will be used to join the two dataframes.
left_on – The column in the actual dataframe that will be used to make to make the join.
right_on – The column in the given dataframe that will be used to make to make the join.
key_middle – Order the columns putting the left df columns before the key column and the right df columns

Returns

Dataframe

keep(cols=None, regex=None) → optimus.helpers.types.DataFrameType[source]

Drop a list of columns.

Parameters

cols – “*”, column name or list of column names to be processed.
regex – Regex expression to select the columns

Returns

kurtosis(cols='*', tidy=True, compute=True)[source]

Returns the kurtosis of the values over the requested columns.

Parameters

cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.

Returns

Returns the kurtosis of the values over the requested columns.

left(cols='*', n=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Get the substring from the first character to the nth from right to left.

Parameters

cols – “*”, column name or list of column names to be processed.
n – Number of character to get starting from 0.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

lemmatize_verbs(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Finding the lemma of a word depending on its meaning and context.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

len(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]: Return the length of every string in a column. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return:

levenshtein(cols='*', other_cols=None, value=None, output_cols=None)[source]

Calculate the levenshtein distance to a specified column. The Levenshtein distance is a string metric for measuring the difference between two sequences.

Parameters

cols – ‘*’, list of columns names or a single column name.
other_cols –
value –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

ln(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the natural logarithm of each value in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the natural logarithm of each element.

log(cols='*', base=10, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the logarithm base 10 of each value in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
base –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the logarithm base 10 of each element.

lower(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Lowercase the specified columns.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

BaseDataFrame

mad(cols='*', relative_error=10000, more=False, estimate=True, tidy=True, compute=True)[source]

Parameters

cols – “*”, column name or list of column names to be processed.
relative_error –
more –
estimate –
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value. :param compute: Compute the result or return a delayed function.

match_rating_codex(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

The match rating approach (MRA) is a phonetic algorithm developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.

Parameters

cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

max(cols='*', numeric=None, tidy: bool = True, compute: bool = True)[source]

Return the maximum value over one or one each column.

Parameters

cols – “*”, column name or list of column names to be processed.
numeric – if True, cast to numeric before processing.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.

Returns

max_abs_scaler(cols='*', output_cols=None)[source]

Scale each feature by its maximum absolute value.

Parameters

cols – ‘*’, list of columns names or a single column name
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

mean(cols='*', tidy=True, compute=True)[source]

Return the mean of the values over the requested columns.

Parameters

cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.

Returns

Column containing the cumulative sum.

median(cols='*', relative_error=10000, tidy=True, compute=True)[source]

Returns the median of the values over the requested columns.

Parameters

cols – “*”, column name or list of column names to be processed.
relative_error –
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute –

Returns

Returns the median of the values over the requested columns

metaphone(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply the Metaphone algorithm to a specified column. Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation.

Parameters

cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

mid(cols='*', start=0, n=1, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Get the substring from

Parameters

cols – “*”, column name or list of column names to be processed.
start –
n –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

min(cols='*', numeric=None, tidy: bool = True, compute: bool = True)[source]

Return the minimum value over one or one each column.

Parameters

cols – “*”, column name or list of column names to be processed.
numeric – if True, cast to numeric before processing.
tidy – The result format. If True it will return a value if you process a column or column name and

value if not. If False it will return the functions name, the column name. and the value. :param compute: C :return:

min_max_scaler(cols='*', output_cols=None)[source]

Transform features by scaling each feature to a given range.

Parameters

cols – ‘*’, list of columns names or a single column name
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

minute(cols='*', format: Optional[str] = None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Get the minutes from a date in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
format – String format
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

minutes_between(cols='*', value=None, date_format=None, round=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the number of minutes between two dates.

Parameters

cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

mod(cols='*', divisor=2, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the Modulo of each value in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
divisor –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing Molulo of each element.

mode(cols='*', tidy: bool = True, compute: bool = True)[source]

Return the mode value over.

Parameters

cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.

Returns

modified_z_score(cols='*', estimate=True, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Returns the modified z-score of the given columns.

Parameters

cols – ‘*’, list of columns names or a single column name
estimate –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Returns the modified z-score of the given columns.

month(cols='*', format: Optional[str] = None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Get the month from a date in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
format – String format
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

months_between(cols='*', value=None, date_format=None, round=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the number of months between two dates.

Parameters

cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

move(column, position, ref_col=None) → optimus.helpers.types.DataFrameType[source]

Move a column to a specific position.

Parameters

column – Column(s) to be moved
position – Column new position. Accepts ‘after’, ‘before’, ‘beginning’, ‘end’ or a numeric value, relative to ‘ref_col’.
ref_col – Column taken as reference

Returns

DataFrame

mul(cols='*', output_col=None) → optimus.helpers.types.DataFrameType[source]

Multiply two or more columns.

Parameters

cols – ‘*’, list of columns names or a single column name
output_col – Single output column in case no value is passed

Returns

Dataframe with the result of the arithmetic operation appended.

names(cols='*', data_types=None, invert=False, is_regex=None) → list[source]

Return the names of the columns.

Parameters

cols – Regex, “*” or columns to get.
data_types – returns only columns with matching data types
invert – invert column selection
is_regex – if True, forces cols regex as a regex

Returns

abstract static nest(cols, separator='', output_col=None, drop=True, shape='string') → optimus.helpers.types.DataFrameType[source]

Concatenate two or more columns into one.

Parameters

cols – ‘*’, list of columns names or a single column name
separator –
output_col – Column name or list of column names where the transformed data will be saved.
drop –
shape –

Returns

Columns with all the specified columns concatenated.

ngram_fingerprint(cols='*', n_size=2, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Calculate the ngram for a fingerprinted string.

Parameters

cols – “*”, column name or list of column names to be processed.
n_size – The ngram size.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

ngrams(cols='*', n_size=2, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Calculate the ngram for a fingerprinted string.

Parameters

cols – ‘*’, list of columns names or a single column name.
n_size – The ngram size.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

normalize_chars(cols='*', output_cols=None)[source]

Remove diacritics from a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

normalize_spaces(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Remove extra whitespace between words and trim whitespace from the beginning and the end of each string.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

num_to_words(cols='*', language='en', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Convert numbers to its string representation.

Parameters

cols – “*”, column name or list of column names to be processed.
language –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column with number converted to its string representation.

nysiis(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply the NYSIIS algorithm to a specified column. NYSIIS (New York State Identification and Intelligence System).

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

one_hot_encode(cols='*', prefix=None, drop=True, **kwargs) → optimus.helpers.types.DataFrameType[source]: Maps a categorical column to multiple binary columns, with at most a single one-value. :param cols: Columns to be encoded. :param prefix: Prefix of the columns where the output is going to be saved. :param drop: :return: Dataframe with encoded columns.

pad(cols='*', width=0, fill_char='0', side='left', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Fill a string to match the given string length.

Parameters

cols – “*”, column name or list of column names to be processed.
width – Total length of the string.
fill_char – The char that will be used to fill the string.
side – Fill the left or the right side.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

parse_inferred_types(col_data_type)[source]

Parse a engine column specific data type to a profiler data type.

Parameters: col_data_type – Engine column specific data.
Returns: Dict

pattern(cols='*', output_cols=None, mode=0) → optimus.helpers.types.DataFrameType[source]

Replace alphanumeric and punctuation chars for canned chars. We aim to help to find string patterns: c = Any alpha char in lower or upper case l = Any alpha char in lower case U = Any alpha char in upper case * = Any alphanumeric in lower or upper case. Used only in type 2 nd 3 # = Any numeric ! = Any punctuation

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
mode – 0: Identify lower, upper, digits. Except spaces and special chars. 1: Identify chars, digits. Except spaces and special chars 2: Identify Any alphanumeric. Except spaces and special chars 3: Identify alphanumeric and special chars. Except white spaces

pattern_counts(cols='*', n=10, mode=0, flush=False) → dict[source]

Get how many equal patterns there are in a column. Triggers the operation only if necessary.

Parameters

cols – “*”, column name or list of column names to be processed.
n – Top n matches
mode –
flush – Flushes the cache to process again

Returns

percentile(cols='*', values=None, relative_error=10000, estimate=True, tidy=True, compute=True)[source]

Return values at the given percentile over requested column.

Parameters

cols – “*”, column name or list of column names to be processed.
values – Percentiles values you want to calculate. 0.25,0.5,0.75
relative_error –
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.

Returns

Return values at the given percentile over requested column.

port(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the port string from a url.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

pos(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word .

Parameters

cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

pow(cols='*', power=2, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the power of each value in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
power –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the power of each element.

profile(cols='*', bins: int = 32, flush: bool = False) → dict[source]

Returns the profile of selected columns.

Parameters

cols – “*”, column name or list of column names to be processed.
bins – Number of buckets.
flush – Flushes the cache of the whole profile to process it again.

Returns

Returns the profile of selected columns.

qcut(cols='*', quantiles=None, output_cols=None)[source]

Parameters

cols – “*”, column name or list of column names to be processed.
quantiles –
output_cols –

Returns

quality(cols='*', flush=False, compute=True) → dict[source]

Return the data quality in the format {‘col_name’: {‘mismatch’: 0, ‘missing’: 9, ‘match’: 0, ‘inferred_data_type’: ‘object’}}

Parameters

cols – “*”, column name or list of column names to be processed.
flush –
compute –

Returns

dict in the format {‘col_name’: {‘mismatch’: 0, ‘missing’: 9, ‘match’: 0, ‘inferred_data_type’: ‘object’}}

range(cols='*', tidy: bool = True, compute: bool = True)[source]

Return the minimum and maximum of the values over the requested columns.

Parameters

cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.

Returns

rdiv(cols='*', output_col=None) → optimus.helpers.types.DataFrameType[source]

Divide two or more columns.

Parameters

cols – ‘*’, list of columns names or a single column name
output_col – Single output column in case no value is passed

Returns

Dataframe with the result of the arithmetic operation appended.

reciprocal(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the reciprocal(1/x) of of each value in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the reciprocal of each element.

remove(cols='*', search=None, search_by='chars', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Remove values from a string in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
search –
search_by – Search by ‘chars’,
output_cols – Column name or list of column names where the transformed data will be saved.:param search:

Returns

remove_numbers(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Remove numbers from a string in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

remove_special_chars(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Remove special chars from a string in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

remove_stopwords(cols='*', language='english', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Remove extra whitespace between words and trim whitespace from the beginning and the end of each string.

Parameters

cols – “*”, column name or list of column names to be processed.
language – specify the stopwords language
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

remove_urls(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Remove urls from the one or more columns.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

remove_white_spaces(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Remove all white spaces from string in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

rename(cols: Union[str, list, dict] = '*', names: Optional[Union[str, list]] = None, func=None) → optimus.helpers.types.DataFrameType[source]

Changes the name of a column(s) dataFrame.

Parameters

cols – string, dictionary or list of strings or tuples. Each tuple may have following form: (oldColumnName, newColumnName).
names – string or list of strings with new names of columns. Ignored if a dictionary or list of tuples is passed to cols.
func – can be lower, upper or any string transformation function.

Returns

Dataframe with columns names replaced.

replace(cols='*', search=None, replace_by=None, search_by=None, ignore_case=False, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Replace a value, list of values by a specified string.

Parameters

cols – ‘*’, list of columns names or a single column name.
search – Values to look at to be replaced
replace_by – New value to replace the old one. Supports an array when searching by characters.
search_by – Can be “full”,”words”,”chars” or “values”.
ignore_case – Ignore case when searching for match
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

DataFrame

replace_regex(cols='*', search=None, replace_by=None, search_by=None, ignore_case=False, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Replace a value, list of values by a specified regex.

Parameters

cols – ‘*’, list of columns names or a single column name.
search – Values to look at to be replaced
replace_by – New value to replace the old one. Supports an array when searching by characters.
search_by – Can be “full”,”words”,”chars” or “values”.
ignore_case – Ignore case when searching for match
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

abstract static reverse(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Reverse the order of the characters strings in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

right(cols='*', n=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Get the substring from the last character to n.

Parameters

cols – “*”, column name or list of column names to be processed.
n –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

round(cols='*', decimals=0, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Round a DataFrame to a variable number of decimal places.

Parameters

cols – “*”, column name or list of column names to be processed.
decimals – The number of decimals you want to
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the round of each element.

schema_data_type(cols='*', tidy=True)[source]

Return the column(s) data type as Type.

Parameters

cols – Columns to be processed
tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.

Returns

second(cols='*', format: Optional[str] = None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Get the seconds from a date in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
format –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

seconds_between(cols='*', value=None, date_format=None, round=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the number of seconds between two dates.

Parameters

cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

select(cols='*', regex=None, data_type=None, invert=False, accepts_missing_cols=False) → optimus.helpers.types.DataFrameType[source]

Select columns using index, column name, regex to data type.

Parameters

cols – “*”, column name or list of column names to be processed.
regex – Regular expression to filter the columns
data_type – Data type to be filtered for
invert – Invert the selection
accepts_missing_cols –

Returns

set(cols='*', value_func=None, where: Optional[Union[str, optimus.helpers.types.MaskDataFrameType]] = None, args=None, default=None, eval_value: bool = False) → optimus.helpers.types.DataFrameType[source]

Set a column value using a number, string or an expression.

Parameters

cols – Columns to set or create.
value_func – expression, function or value.
where – When the condition in ‘where’ is True, replace with ‘value_func’. Where False, replace with ‘default’ or keep the original value.
args – Argument when ‘value_func’ param is a function.
default – Entries where ‘where’ is False are replaced with corresponding value from other.
eval_value – Parse ‘value_func’ param in case a string is passed.

Returns

set_data_type(cols: Union[str, list, dict] = '*', data_types: Optional[Union[str, list]] = None, inferred: bool = False) → optimus.helpers.types.DataFrameType[source]

Set profiler data type.

Parameters

cols – A dict with the form {“col_name”: profiler datatype}, a list of columns or a single column.
data_types – If a string or a list passed to cols, uses this parameter to set the data types to those columns.
inferred – Whether it was inferred or not.

Returns

Dataframe with new data types in the meta data.

set_date_format(cols: Union[str, list, dict] = '*', date_formats: Optional[Union[str, list]] = None, inferred: bool = False) → optimus.helpers.types.DataFrameType[source]

Set date format.

Parameters

cols – A dict with the form {“col_name”: “date format”}, a list of columns or a single column
date_formats – If a string or a list passed to cols, uses this parameter to set the date format to those columns.
inferred – Whether it was inferred or not.

Returns

sin(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply sine function to a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the sine of each element.

sinh(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply the hyperbolic sine function to a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the arctangent of each element.

skew(cols='*', tidy=True, compute=True)[source]

Return the skew of the values over the requested columns.

Parameters

cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.

Returns

Return the skew of the values over the requested columns.

slice(cols='*', start=None, stop=None, step=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Slice substrings from each element in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
start – Start position for slice operation.
stop – Stop position for slice operation.
step – Step size for slice operation.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

sort(order: Union[str, list] = 'asc', cols=None) → optimus.helpers.types.DataFrameType[source]

Sort one or multiple columns in asc or desc order.

Parameters

order – ‘asc’ or ‘desc’ accepted
cols –

Returns

Column containing the cumulative sum.

soundex(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply the Soundex algorithm to a specified column. Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

Parameters

cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

sqrt(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the square root of each value in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the square root of each element.

standard_scaler(cols='*', output_cols=None)[source]

Standardize features by removing the mean and scaling to unit variance.

Parameters

cols – ‘*’, list of columns names or a single column name
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

std(cols='*', tidy=True, compute=True)[source]

Return unbiased variance over requested columns.

Parameters

cols – “*”, column name or list of column names to be processed.
tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.
compute – Compute the final result. False imply to return a delayed object.

Returns

stem_verbs(cols='*', stemmer: str = 'porter', language: str = 'english', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Parameters

cols – “*”, column name or list of column names to be processed.
stemmer – snowball, porter, lancaster
language –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

abstract static string_to_index(cols=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Encodes a string column of labels to a column of label indices.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

strip_html(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Remove HTML tags.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

sub(cols='*', output_col=None) → optimus.helpers.types.DataFrameType[source]

Subtract two or more columns.

Parameters

cols – ‘*’, list of columns names or a single column name
output_col – Single output column in case no value is passed

Returns

Dataframe with the result of the arithmetic operation appended.

sub_domain(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Returns the subdomain string from a url. From https://www.hi-optimus.com it returns ‘www’.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

sum(cols='*', tidy=True, compute=True)[source]

Return the sum of the values over the requested column.

Parameters

cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.

Returns

Column containing the sum of multiple columns.

tan(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Apply the tangent function to a column.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Column containing the tangent of each element.

tanh(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]: Apply the hyperbolic tangent function to a column. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return: Column containing the hyperbolic tangent of each element.

tf_idf(features) → optimus.helpers.types.DataFrameType[source]

Parameters: features –
Returns

time_between(cols='*', value=None, date_format=None, round=None, output_cols=None, func=None) → optimus.helpers.types.DataFrameType[source]

Returns a TimeDelta of the units between two datetimes.

Parameters

cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.
func – Custom function to pass to the apply, like self.F.days_between

Returns

title(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Capitalize the first word in a sentence.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

BaseDataFrame

to_boolean(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]: Cast the elements inside a column or a list of columns to boolean. :param cols: “*”, column name or list of column names to be processed. :param output_cols: :return:

to_datetime(cols='*', format=None, output_cols=None, transform_format=True) → optimus.helpers.types.DataFrameType[source]

TODO:?

Parameters

cols – “*”, column name or list of column names to be processed.
format –
output_cols – Column name or list of column names where the transformed data will be saved.
transform_format –

Returns

to_float(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]: Cast the elements inside a column or a list of columns to float. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return:

to_integer(cols='*', default=0, output_cols=None) → optimus.helpers.types.DataFrameType[source]: Cast the elements inside a column or a list of columns to integer. :param cols: “*”, column name or list of column names to be processed. :param default: :param output_cols: Column name or list of column names where the transformed data will be saved. :return:

to_string(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]: Cast the elements inside a column or a list of columns to string. :param cols: “*”, column name or list of column names to be processed. :param output_cols: :return:

abstract static to_timestamp(cols, date_format=None, output_cols=None)[source]

Parameters

cols –
date_format –
output_cols –

Returns

top_domain(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Returns the top domain string from a url. From ‘https://www.hi-optimus.com’ it returns ‘hi-optimus.com’.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

trim(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Remove leading and trailing characters.

Strip whitespaces (including newlines) or a set of specified characters from each string in the column from left and right sides. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return:

unique_values(cols='*', estimate=False, compute=True, tidy=True) → list[source]

Return a list of uniques values in a column.

Parameters

cols – ‘*’, list of columns names or a single column name.
estimate –
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you

process a column or column name and value if not. If False it will return the functions name, the column name and the value.

unnest(cols='*', separator=None, splits=2, index=None, output_cols=None, drop=False, mode='string') → optimus.helpers.types.DataFrameType[source]

Split the columns values (array or string) in different columns.

Parameters

cols – Columns to be un-nested
output_cols – Resulted on or multiple columns after the unnest operation [(output_col_1_1,output_col_1_2),

(output_col_2_1, output_col_2] :param separator: char or regex :param splits: Number of columns splits. :param index: Return a specific index per columns. [1,2] :param drop: :param mode:

unset_data_type(cols='*')[source]

Unset user set data type.

Parameters: cols – ‘*’, list of columns names or a single column name.
Returns

unset_date_format(cols='*')[source]

Unset user defined date format.

Parameters: cols – ‘*’, list of columns names or a single column name.
Returns

upper(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Uppercase the specified columns.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

BaseDataFrame

url_file(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Returns the file string from a url. From https://www.hi-optimus.com/index.html it returns ‘index.html’.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

url_fragment(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

url_path(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Returns the top domain string from a url. From https://www.hi-optimus.com it returns ‘hi-optimus.com’.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

url_query(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Returns the query string from a url. From https://www.hi-optimus.com/?rollout=true it returns ‘roolout=true’.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

url_scheme(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Returns the top domain string from a url. From ‘https://www.hi-optimus.com’ it returns ‘https’.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

var(cols='*', tidy=True, compute=True)[source]

Return unbiased variance over requested columns.

Parameters

cols – “*”, column name or list of column names to be processed.
tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.
compute – Compute the final result. False imply to return a delayed object.

Returns

weekday(cols='*', format: Optional[str] = None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Get the hour from a date in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
format –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

word_count(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Count the number of words in a paragraph.

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

word_tokenize(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Parameters

cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

year(cols='*', format: Optional[str] = None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Get the Year from a date in a column.

Parameters

cols – “*”, column name or list of column names to be processed.
format – String format
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

years_between(cols='*', value=None, date_format=None, round=None, output_cols=None) → optimus.helpers.types.DataFrameType[source]

Return the number of years between two dates.

Parameters

cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

z_score(cols='*', output_cols=None) → optimus.helpers.types.DataFrameType[source]

Returns the z-score of the given columns.

Parameters

cols – ‘*’, list of columns names or a single column name
output_cols – Column name or list of column names where the transformed data will be saved.

Returns

Dataframe with the z-score of the given columns appended.