Columns
- class optimus.engines.base.columns.BaseColumns(root: optimus.helpers.types.DataFrameType)[source]
Base class for all Cols implementations
- abs(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the absolute numeric value of each value in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the absolute value of each element.
- acos(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply arccosine function to a column. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return: Column containing the arccosine of each element.
- acosh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the arcus hyperbolic cosine function to a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the arcus hyperbolic cosine of each element.
- add(cols='*', output_col=None) optimus.helpers.types.DataFrameType [source]
Apply a plus operation to two or more columns.
- Parameters
cols – ‘*’, list of columns names or a single column name.
output_col – Single output column in case no value is passed.
- Returns
Dataframe with the result of the arithmetic operation appended.
- agg_exprs(cols='*', funcs=None, *args, compute=True, tidy=True, parallel=False)[source]
Run a list of aggregation functions.
- Parameters
cols – Column over with to apply the aggregations functions.
funcs – List of aggregation functions.
args –
compute – Compute the result or return a delayed function.
tidy – Compact the dict output.
parallel – Execute the function in every column or apply it over the whole dataframe.
- Returns
Return the calculates values from a list of aggregations functions.
- any_greater_than(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]
- Parameters
cols –
value –
inverse –
tidy –
compute –
- Returns
- abstract append(dfs: optimus.helpers.types.DataFrameTypeList) optimus.helpers.types.DataFrameType [source]
Appends one or more columns or dataframes.
- Parameters
dfs – DataFrame, list of dataframes or list of columns to append to the dataframe
- Returns
DataFrame
- apply(cols='*', func=None, func_return_type=None, args=None, func_type=None, where=None, filter_col_by_data_types=None, output_cols=None, skip_output_cols_processing=False, meta_action='apply_cols', mode='vectorized', set_index=False, default=None, **kwargs) optimus.helpers.types.DataFrameType [source]
- Parameters
cols – “*”, column name or list of column names to be processed.
func –
func_return_type –
args –
func_type –
where –
filter_col_by_data_types –
output_cols – Column name or list of column names where the transformed data will be saved.
skip_output_cols_processing –
meta_action –
mode –
set_index –
default –
kwargs –
- Returns
- apply_by_data_types(cols='*', func=None, args=None, data_type=None) optimus.helpers.types.DataFrameType [source]
Apply a function using pandas udf or udf if apache arrow is not available.
- Parameters
cols – “*”, column name or list of column names to be processed.
func – Functions to be applied to a columns
args –
func – pandas_udf or udf. If ‘None’ try to use pandas udf (Pyarrow needed)
data_type –
- Returns
- asin(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the arcsine function to a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the arcsine of each element.
- asinh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the arcus hyperbolic sine function to a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the arcus hyperbolic sin of each element.
- assign(cols: Optional[Union[str, list, dict]] = None, values=None, **kwargs)[source]
Assign new columns to a Dataframe.
Returns a DataFrame with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
- Parameters
cols – A dict with the form {“col_name”: “value”}, a list of columns or a single column
values – When no dict is passed to ‘cols’, uses this parameter to get the values.
kwargs –
- Returns
- abstract static astype(*args, **kwargs)[source]
Alias from cast function for compatibility with the pandas API.
- Parameters
args –
kwargs –
- Returns
- atan(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the arctangent function to a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the arctangent of each element.
- atanh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the arcus hyperbolic tangent function to a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the arcus hyperbolic tangent of each element.
- bag_of_words(features, analyzer='word', ngram_range=2) optimus.helpers.types.DataFrameType [source]
- Parameters
analyzer –
features –
ngram_range –
- Returns
- boxplot(cols='*') dict [source]
Return the boxplot data in python dict format.
- Parameters
cols – “*”, column name or list of column names to be processed.
- Returns
dict with box plot data.
- calculate_pattern_counts(cols='*', n=10, mode=0, flush=False) optimus.helpers.types.DataFrameType [source]
Counts how many equal patterns there are in a column. Uses a cache to trigger the operation only if necessary. Saves the result to meta and returns the same dataframe.
- Parameters
cols – “*”, column name or list of column names to be processed.
n – Return the Top n matches.
mode – mode use to calculate the patterns.
flush – Flushes the cache to process again
- Returns
- capitalize(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Capitalize every word in a sentence.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- cast(cols=None, data_type=None, output_cols=None, *args, **kwargs) optimus.helpers.types.DataFrameType [source]
NOTE: We have two ways to cast the data. Use the use the native .astype() this is faster but can not handle some transformation like string to number in which should output nan.
Cast the elements inside a column or a list of columns to a specific data type. Unlike ‘cast’ this not change the columns data type
- Parameters
cols – Columns names to be casted or, dictionary or list of tuples of column names and types to be casted with the following structure: cols = [(‘columnName1’, ‘integer’), (‘columnName2’, ‘float’), (‘columnName3’, ‘string’)] The first parameter in each tuple is the column name, the second is the final datatype of column after the transformation is made.
output_cols – Column name or list of column names where the transformed data will be saved.
data_type – final data type
args – passed to cast function (df.cols.to_integer(…, -1)).
kwargs – passed to cast function (df.cols.to_integer(…, default=-1)).
- Returns
Return the casted columns.
- ceil(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Round each number in a column up to the nearest integer.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the ceil of each element.
- clip(cols='*', lower_bound=None, upper_bound=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Assigns values outside boundary to boundary values.
- Parameters
cols – “*”, column name or list of column names to be processed.
lower_bound – Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.
upper_bound – Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- concat(dfs: optimus.helpers.types.DataFrameTypeList) optimus.helpers.types.DataFrameType [source]
Same as append.
- Parameters
dfs – DataFrame, list of dataframes or list of columns to append to the dataframe
- Returns
DataFrame
- copy(cols='*', output_cols=None, columns=None) optimus.helpers.types.DataFrameType [source]
Copy one or multiple columns.
- Parameters
cols – Source column to be copied
output_cols – Column name or list of column names where the transformed data will be saved.
columns – tuple of column [(‘column1’,’column_copy’)(‘column1’,’column1_copy’)()]
- Returns
- correlation(cols='*', method='pearson', compute=True, tidy=True)[source]
Compute pairwise correlation of columns, excluding NA/null values.
- Parameters
cols – “*”, column name or list of column names to be processed.
method –
- Method of correlation:
pearson : standard correlation coefficient kendall : Kendall Tau correlation coefficient spearman : Spearman rank correlation
callable: callable with input two 1d ndarrays and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.
- Parameters
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return:
- cos(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply cosine function to a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the cosine of each element.
- cosh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the hyperbolic cosine function to a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the hyperbolic cosine of each element.
- count() int [source]
Returns the number of columns in the dataframe.
- Returns
Returns the number of columns in the dataframe.
- count_array(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number of lists in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_between(cols='*', lower_bound=None, upper_bound=None, equal=True, bounds=None, inverse=False, tidy=True, compute=True)[source]
Count the number of elements between and lower and upper bound in given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
lower_bound – Lower bound.
upper_bound – Upper bound.
equal –
bounds –
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_boolean(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number booleans in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_containing(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_data_type(cols='*', data_type=None, inverse=False, tidy=True, compute=True)[source]
Count the number of mismatch values in a given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
data_type –
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_datetime(cols='*', inverse=False, tidy=True, compute=True)[source]
- Parameters
cols –
inverse –
tidy –
compute –
- Returns
- count_duplicated(cols='*', keep='first', inverse=False, tidy=True, compute=True)[source]
- Parameters
cols –
keep –
inverse –
tidy –
compute –
- Returns
- count_email(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number of strings that look like an email address in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_empty(cols='*', inverse=False, tidy=True, compute=True)[source]
Count the number of empty values in a given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_ending_with(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]
Counts the number of elements that ends with the given string.
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_equal(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]
Count the number of elements equal to a value in given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_expression(value=None, inverse=False, tidy=True, compute=True)[source]
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_float(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number of floats in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_gender(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number of strings that look like a gender in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_greater_than(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]
Count the number of elements greater or equal to a value in given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_greater_than_equal(cols='*', value=None, inverse=False, compute=True, tidy=True)[source]
Count the number of elements greater than or equal to a value in given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_http_code(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number of strings that look like http code in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_int(cols='*', inverse=False, tidy=True, compute=True)[source]
Count the number of integers in a column. :param cols: ‘*’, list of columns names or a single column name. :param inverse: Inverse the function selection. :param compute: Compute the result or return a delayed function. :param tidy: The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return:
- count_ip(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number of strings that look like an ip address in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_less_than(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]
Count the number of elements smaller than to a value in given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_less_than_equal(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]
Count the number of elements smaller than or equal to a value in given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_match(cols='*', regex=None, data_type=None, inverse=False, tidy=True, compute=True)[source]
Counts the number of match values in a given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
data_type –
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_match_pattern(cols='*', pattern=None, inverse=False, tidy=True, compute=True)[source]
- Parameters
cols –
pattern –
inverse –
tidy –
compute –
- Returns
- count_mismatch(cols='*', data_type=None, inverse=False, tidy=True, compute=True)[source]
Count the number of mismatch values in a given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
data_type –
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_missings(cols='*', inverse=False, tidy=True, compute=True)[source]
Count the number of missing values in given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_nan(cols='*', inverse=False, tidy=True, compute=True)[source]
Count the number of ‘nan’ values in a given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_none(cols='*', inverse=False, tidy=True, compute=True)[source]
Count the number of ‘None’ values in given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_not_equal(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]
Count the number of elements not equal to a value in given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_nulls(cols='*', inverse=False, tidy=True, compute=True)[source]
Count the number of ‘nulls’ values in a given column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_numeric(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the numeric elements in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_object(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts python object in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_phone_number(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number of strings that look like phone number in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_regex(cols='*', regex=None, inverse=False, tidy=True, compute=True)[source]
Counts the number of elements that match a regular expression.
- Parameters
cols – ‘*’, list of columns names or a single column name.
regex – regular expression.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_social_security_number(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number of strings that look like social security number in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_starting_with(cols='*', value=None, inverse=False, tidy=True, compute=True)[source]
Counts the number of elements that start with the given string.
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_str(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number of strings in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_uniques(cols='*', estimate=False, compute=True, tidy=True) int [source]
Count the number of uniques values in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
estimate –
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return:
- count_url(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number of strings that look like an url address in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_values_in(cols='*', values=None, inverse=False, tidy=True, compute=True)[source]
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – Value used to evaluate the function.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- count_zeros(cols='*', tidy=True, compute=True)[source]
Return the count of zeros by column.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy –
compute –
- Returns
- count_zip_code(cols='*', inverse=False, tidy=True, compute=True)[source]
Counts the number of strings that look like a zip code s in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
inverse – Inverse the function selection.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: The number of elements that match the function.
- cross_tab(col_x, col_y, output='dict', compute=True) dict [source]
- Parameters
col_x –
col_y –
output –
compute – Compute the result or return a delayed function.
- Returns
- cummax(cols='*', output_cols=None)[source]
Return cumulative maximum over a DataFrame or column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the cumulative maximum.
- cummin(cols='*', output_cols=None)[source]
Return cumulative minimum over a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the cumulative minimum.
- cumprod(cols='*', output_cols=None)[source]
Return cumulative product over a DataFrame or column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the cumulative product.
- cumsum(cols='*', output_cols=None)[source]
Return cumulative sum over a DataFrame or column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the cumulative sum.
- cut(cols='*', bins=None, labels=None, default=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
- Parameters
cols – “*”, column name or list of column names to be processed.
bins –
labels –
default –
output_cols –
- Returns
- data_type(cols='*', names=False, tidy=True) dict [source]
Return the column(s) data type as string.
- Parameters
cols – Columns to be processed
names – Returns aliases for every type instead of its internal name
- Returns
Return a dict of column and its respective data type.
- date_format(cols='*', tidy=True, compute=True, cached=None, **kwargs)[source]
Get the date format from a column, compatible with ‘format_date’.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.
compute – Compute the final result. False imply to return a delayed object.
cached – {None, True, False}, Gets cached date_formats (True), calculates them (False) or a combination of both (None).
kwargs –
- Returns
- date_formats(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Get the date format for every value in specified columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
BaseDataFrame
- day(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Get the day from a date in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
format – String format
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- days_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the number of days between two dates.
- Parameters
cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- div(cols='*', output_col=None) optimus.helpers.types.DataFrameType [source]
Divide two or more columns.
- Parameters
cols – ‘*’, list of columns names or a single column name
output_col – Single output column in case no value is passed
- Returns
Dataframe with the result of the arithmetic operation appended.
- domain(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Returns the domain string from a url. From https://www.hi-optimus.com it returns hi-optimus.com.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- double_metaphone(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
The Double Metaphone phonetic encoding algorithm is the second generation of this algorithm. It is called “Double” because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry
- Parameters
cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- drop(cols=None, regex=None, data_type=None) optimus.helpers.types.DataFrameType [source]
Drop a list of columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
regex – Regex expression to select the columns
data_type –
- Returns
- duplicate(cols='*', output_cols=None, columns=None) optimus.helpers.types.DataFrameType [source]
Alias of copy function.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
columns – tuple of column [(‘column1’,’column_copy’)(‘column1’,’column1_copy’)()]
- Returns
- email_domain(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the domain from an email address. From optimus@mail.col it will return ‘mail’.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- email_username(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the username from an email address. From optimus@mail.col it will return ‘optimus’.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- exec_agg(exprs, compute=True)[source]
Execute one or multiple aggregations functions.
- Parameters
exprs –
compute – Compute the result or return a delayed function.
- Returns
- exp(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Return Euler’s number, e (~2.718) raised to the power of each value in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the absolute value of each element.
- expand_contracted_words(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Expand contracted words.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- extract(cols='*', regex=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Extract a string that match a regular expression.
- Parameters
cols – “*”, column name or list of column names to be processed.
regex – Regular expression
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- fill_na(cols='*', value=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Replace null data with a specified value.
- Parameters
cols – ‘*’, list of columns names or a single column name.
value – value to replace the nan/None values
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Returns the column filled with given value.
- fingerprint(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Create the fingerprint for a column
- Parameters
cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- floor(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Round each number in a column down to the nearest integer.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the floor of each element.
- format_date(cols='*', current_format=None, output_format=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
TODO: missing description
- Parameters
cols – “*”, column name or list of column names to be processed.
current_format –
output_format –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- frequency(cols='*', n=32, percentage=False, total_rows=None, count_uniques=False, compute=True, tidy=False) dict [source]
Return the count of every element in the column.
- Parameters
cols – “*”, column name or list of column names to be processed.
n – numbers of bins to be returned.
percentage – if True calculate the
total_rows – If True returned the total count.
count_uniques – If True returned the number of uniques elements.
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: dict with the count of every element in the column.
- get(cols='*', keys=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Return items from a dict over requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
keys – The value of the dict key that will be returned.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the value of the key selected.
- groupby(by, agg) optimus.helpers.types.DataFrameType [source]
This helper function aims to help managing columns name in the aggregation output. Also how to handle ordering columns because dask can order columns.
- Parameters
by – Column name.
agg – List of tuples with the form [(“agg”, “col”)]
- Returns
- heatmap(col_x, col_y, bins_x=10, bins_y=10, compute=True) dict [source]
- Parameters
col_x –
col_y –
bins_x –
bins_y –
compute –
- Returns
- hist(cols='*', buckets=32, compute=True) dict [source]
Return the histogram representation of the distribution of the data.
- Parameters
cols – “*”, column name or list of column names to be processed.
:param buckets:Number of histogram bins to be used. :param compute: :return:
- host(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Returns the host string from a url.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- hour(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Get the hour from a date in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
format – String format
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- hours_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the number of hours between two dates.
- Parameters
cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- impute(cols='*', data_type='auto', strategy='auto', fill_value=None, output_cols=None)[source]
Fill null values using a constant or any of the strategy available.
- Parameters
cols – “*”, column name or list of column names to be processed.
data_type –
If “auto”, detect if it’s continuous or categorical using the data type of the column.
If “continuous”, sets the data as continuous and if no ‘strategy’ is passed then the mean is used.
If “categorical”, sets the data as categorical and if no ‘strategy’ is passed then the most frequent value is used.
strategy –
If “auto”, automatically selects a strategy depending on the data type passed or inferred on ‘data_type’.
If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
fill_value – constant to be used to fill null values
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Return the Column filled with the imputed values.
- abstract static index_to_string(cols=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Maps a column of label indices back to a column containing the original labels as strings.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- infer_data_types(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols –
- Returns
- infer_date_formats(cols='*', sample=200, tidy=True) dict [source]
Infer date formats in a dataframe from a sample. This function use Pandas no matter the engine you are using.
- Parameters
cols – Columns in which you want to infer the datatype.
- Returns
dict with the column and the inferred date format
- infer_type(cols='*', sample=200, tidy=True) dict [source]
Infer data types in a dataframe from a sample. First it identify the data type of every value in every cell. After that it takes all ghe values apply som heuristic to try to better identify the datatype. This function use Pandas no matter the engine you are using.
- Parameters
cols – “*”, column name or list of column names to be processed.
sample –
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name and the value.
- Returns
dict with the column and the inferred data type.
- inferred_data_type(cols='*', use_internal=False, tidy=True)[source]
Get the inferred data types from the meta data.
- Parameters
cols – “*”, column name or list of column names to be processed.
use_internal – If no inferred data type is found, return a translated internal data type instead of None.
tidy – The result format. If ‘True’ it will return a value if you ‘False’ will return the column name a value.
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :return: Python Dictionary with column names and its data types.
- iqr(cols='*', more=None, relative_error=10000, estimate=True)[source]
Return the column Inter Quartile Range value.
- Parameters
cols – “*”, column name or list of column names to be processed.
more – Return info about q1 and q3
relative_error –
- Returns
Return the column Inter Quartile Range value.
- item(cols='*', n=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Return items from a list over requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
n – The position of the element that will be returned.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the value of the item selected.
- join(df_right: optimus.helpers.types.DataFrameType, how='left', on=None, left_on=None, right_on=None, key_middle=False) optimus.helpers.types.DataFrameType [source]
Join two dataframes using a column.
- Parameters
df_right – The dataframe that will be used to join the actual dataframe.
how – {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’
on – The column that will be used to join the two dataframes.
left_on – The column in the actual dataframe that will be used to make to make the join.
right_on – The column in the given dataframe that will be used to make to make the join.
key_middle – Order the columns putting the left df columns before the key column and the right df columns
- Returns
Dataframe
- keep(cols=None, regex=None) optimus.helpers.types.DataFrameType [source]
Drop a list of columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
regex – Regex expression to select the columns
- Returns
- kurtosis(cols='*', tidy=True, compute=True)[source]
Returns the kurtosis of the values over the requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.
- Returns
Returns the kurtosis of the values over the requested columns.
- left(cols='*', n=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Get the substring from the first character to the nth from right to left.
- Parameters
cols – “*”, column name or list of column names to be processed.
n – Number of character to get starting from 0.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- lemmatize_verbs(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Finding the lemma of a word depending on its meaning and context.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- len(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the length of every string in a column. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return:
- levenshtein(cols='*', other_cols=None, value=None, output_cols=None)[source]
Calculate the levenshtein distance to a specified column. The Levenshtein distance is a string metric for measuring the difference between two sequences.
- Parameters
cols – ‘*’, list of columns names or a single column name.
other_cols –
value –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- ln(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the natural logarithm of each value in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the natural logarithm of each element.
- log(cols='*', base=10, output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the logarithm base 10 of each value in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
base –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the logarithm base 10 of each element.
- lower(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Lowercase the specified columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
BaseDataFrame
- mad(cols='*', relative_error=10000, more=False, estimate=True, tidy=True, compute=True)[source]
- Parameters
cols – “*”, column name or list of column names to be processed.
relative_error –
more –
estimate –
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value. :param compute: Compute the result or return a delayed function.
- match_rating_codex(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
The match rating approach (MRA) is a phonetic algorithm developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.
- Parameters
cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- max(cols='*', numeric=None, tidy: bool = True, compute: bool = True)[source]
Return the maximum value over one or one each column.
- Parameters
cols – “*”, column name or list of column names to be processed.
numeric – if True, cast to numeric before processing.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.
- Returns
- max_abs_scaler(cols='*', output_cols=None)[source]
Scale each feature by its maximum absolute value.
- Parameters
cols – ‘*’, list of columns names or a single column name
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- mean(cols='*', tidy=True, compute=True)[source]
Return the mean of the values over the requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.
- Returns
Column containing the cumulative sum.
- median(cols='*', relative_error=10000, tidy=True, compute=True)[source]
Returns the median of the values over the requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
relative_error –
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute –
- Returns
Returns the median of the values over the requested columns
- metaphone(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the Metaphone algorithm to a specified column. Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation.
- Parameters
cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- mid(cols='*', start=0, n=1, output_cols=None) optimus.helpers.types.DataFrameType [source]
Get the substring from
- Parameters
cols – “*”, column name or list of column names to be processed.
start –
n –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- min(cols='*', numeric=None, tidy: bool = True, compute: bool = True)[source]
Return the minimum value over one or one each column.
- Parameters
cols – “*”, column name or list of column names to be processed.
numeric – if True, cast to numeric before processing.
tidy – The result format. If True it will return a value if you process a column or column name and
value if not. If False it will return the functions name, the column name. and the value. :param compute: C :return:
- min_max_scaler(cols='*', output_cols=None)[source]
Transform features by scaling each feature to a given range.
- Parameters
cols – ‘*’, list of columns names or a single column name
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- minute(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Get the minutes from a date in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
format – String format
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- minutes_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the number of minutes between two dates.
- Parameters
cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- mod(cols='*', divisor=2, output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the Modulo of each value in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
divisor –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing Molulo of each element.
- mode(cols='*', tidy: bool = True, compute: bool = True)[source]
Return the mode value over.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.
- Returns
- modified_z_score(cols='*', estimate=True, output_cols=None) optimus.helpers.types.DataFrameType [source]
Returns the modified z-score of the given columns.
- Parameters
cols – ‘*’, list of columns names or a single column name
estimate –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Returns the modified z-score of the given columns.
- month(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Get the month from a date in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
format – String format
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- months_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the number of months between two dates.
- Parameters
cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- move(column, position, ref_col=None) optimus.helpers.types.DataFrameType [source]
Move a column to a specific position.
- Parameters
column – Column(s) to be moved
position – Column new position. Accepts ‘after’, ‘before’, ‘beginning’, ‘end’ or a numeric value, relative to ‘ref_col’.
ref_col – Column taken as reference
- Returns
DataFrame
- mul(cols='*', output_col=None) optimus.helpers.types.DataFrameType [source]
Multiply two or more columns.
- Parameters
cols – ‘*’, list of columns names or a single column name
output_col – Single output column in case no value is passed
- Returns
Dataframe with the result of the arithmetic operation appended.
- names(cols='*', data_types=None, invert=False, is_regex=None) list [source]
Return the names of the columns.
- Parameters
cols – Regex, “*” or columns to get.
data_types – returns only columns with matching data types
invert – invert column selection
is_regex – if True, forces cols regex as a regex
- Returns
- abstract static nest(cols, separator='', output_col=None, drop=True, shape='string') optimus.helpers.types.DataFrameType [source]
Concatenate two or more columns into one.
- Parameters
cols – ‘*’, list of columns names or a single column name
separator –
output_col – Column name or list of column names where the transformed data will be saved.
drop –
shape –
- Returns
Columns with all the specified columns concatenated.
- ngram_fingerprint(cols='*', n_size=2, output_cols=None) optimus.helpers.types.DataFrameType [source]
Calculate the ngram for a fingerprinted string.
- Parameters
cols – “*”, column name or list of column names to be processed.
n_size – The ngram size.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- ngrams(cols='*', n_size=2, output_cols=None) optimus.helpers.types.DataFrameType [source]
Calculate the ngram for a fingerprinted string.
- Parameters
cols – ‘*’, list of columns names or a single column name.
n_size – The ngram size.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- normalize_chars(cols='*', output_cols=None)[source]
Remove diacritics from a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- normalize_spaces(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Remove extra whitespace between words and trim whitespace from the beginning and the end of each string.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- num_to_words(cols='*', language='en', output_cols=None) optimus.helpers.types.DataFrameType [source]
Convert numbers to its string representation.
- Parameters
cols – “*”, column name or list of column names to be processed.
language –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column with number converted to its string representation.
- nysiis(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the NYSIIS algorithm to a specified column. NYSIIS (New York State Identification and Intelligence System).
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- one_hot_encode(cols='*', prefix=None, drop=True, **kwargs) optimus.helpers.types.DataFrameType [source]
Maps a categorical column to multiple binary columns, with at most a single one-value. :param cols: Columns to be encoded. :param prefix: Prefix of the columns where the output is going to be saved. :param drop: :return: Dataframe with encoded columns.
- pad(cols='*', width=0, fill_char='0', side='left', output_cols=None) optimus.helpers.types.DataFrameType [source]
Fill a string to match the given string length.
- Parameters
cols – “*”, column name or list of column names to be processed.
width – Total length of the string.
fill_char – The char that will be used to fill the string.
side – Fill the left or the right side.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- parse_inferred_types(col_data_type)[source]
Parse a engine column specific data type to a profiler data type.
- Parameters
col_data_type – Engine column specific data.
- Returns
Dict
- pattern(cols='*', output_cols=None, mode=0) optimus.helpers.types.DataFrameType [source]
- Replace alphanumeric and punctuation chars for canned chars. We aim to help to find string patterns
c = Any alpha char in lower or upper case l = Any alpha char in lower case U = Any alpha char in upper case * = Any alphanumeric in lower or upper case. Used only in type 2 nd 3 # = Any numeric ! = Any punctuation
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
mode – 0: Identify lower, upper, digits. Except spaces and special chars. 1: Identify chars, digits. Except spaces and special chars 2: Identify Any alphanumeric. Except spaces and special chars 3: Identify alphanumeric and special chars. Except white spaces
- pattern_counts(cols='*', n=10, mode=0, flush=False) dict [source]
Get how many equal patterns there are in a column. Triggers the operation only if necessary.
- Parameters
cols – “*”, column name or list of column names to be processed.
n – Top n matches
mode –
flush – Flushes the cache to process again
- Returns
- percentile(cols='*', values=None, relative_error=10000, estimate=True, tidy=True, compute=True)[source]
Return values at the given percentile over requested column.
- Parameters
cols – “*”, column name or list of column names to be processed.
values – Percentiles values you want to calculate. 0.25,0.5,0.75
relative_error –
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.
- Returns
Return values at the given percentile over requested column.
- port(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the port string from a url.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- pos(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word .
- Parameters
cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- pow(cols='*', power=2, output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the power of each value in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
power –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the power of each element.
- profile(cols='*', bins: int = 32, flush: bool = False) dict [source]
Returns the profile of selected columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
bins – Number of buckets.
flush – Flushes the cache of the whole profile to process it again.
- Returns
Returns the profile of selected columns.
- qcut(cols='*', quantiles=None, output_cols=None)[source]
- Parameters
cols – “*”, column name or list of column names to be processed.
quantiles –
output_cols –
- Returns
- quality(cols='*', flush=False, compute=True) dict [source]
Return the data quality in the format {‘col_name’: {‘mismatch’: 0, ‘missing’: 9, ‘match’: 0, ‘inferred_data_type’: ‘object’}}
- Parameters
cols – “*”, column name or list of column names to be processed.
flush –
compute –
- Returns
dict in the format {‘col_name’: {‘mismatch’: 0, ‘missing’: 9, ‘match’: 0, ‘inferred_data_type’: ‘object’}}
- range(cols='*', tidy: bool = True, compute: bool = True)[source]
Return the minimum and maximum of the values over the requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.
- Returns
- rdiv(cols='*', output_col=None) optimus.helpers.types.DataFrameType [source]
Divide two or more columns.
- Parameters
cols – ‘*’, list of columns names or a single column name
output_col – Single output column in case no value is passed
- Returns
Dataframe with the result of the arithmetic operation appended.
- reciprocal(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the reciprocal(1/x) of of each value in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the reciprocal of each element.
- remove(cols='*', search=None, search_by='chars', output_cols=None) optimus.helpers.types.DataFrameType [source]
Remove values from a string in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
search –
search_by – Search by ‘chars’,
output_cols – Column name or list of column names where the transformed data will be saved.:param search:
- Returns
- remove_numbers(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Remove numbers from a string in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- remove_special_chars(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Remove special chars from a string in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- remove_stopwords(cols='*', language='english', output_cols=None) optimus.helpers.types.DataFrameType [source]
Remove extra whitespace between words and trim whitespace from the beginning and the end of each string.
- Parameters
cols – “*”, column name or list of column names to be processed.
language – specify the stopwords language
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- remove_urls(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Remove urls from the one or more columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- remove_white_spaces(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Remove all white spaces from string in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- rename(cols: Union[str, list, dict] = '*', names: Optional[Union[str, list]] = None, func=None) optimus.helpers.types.DataFrameType [source]
Changes the name of a column(s) dataFrame.
- Parameters
cols – string, dictionary or list of strings or tuples. Each tuple may have following form: (oldColumnName, newColumnName).
names – string or list of strings with new names of columns. Ignored if a dictionary or list of tuples is passed to cols.
func – can be lower, upper or any string transformation function.
- Returns
Dataframe with columns names replaced.
- replace(cols='*', search=None, replace_by=None, search_by=None, ignore_case=False, output_cols=None) optimus.helpers.types.DataFrameType [source]
Replace a value, list of values by a specified string.
- Parameters
cols – ‘*’, list of columns names or a single column name.
search – Values to look at to be replaced
replace_by – New value to replace the old one. Supports an array when searching by characters.
search_by – Can be “full”,”words”,”chars” or “values”.
ignore_case – Ignore case when searching for match
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
DataFrame
- replace_regex(cols='*', search=None, replace_by=None, search_by=None, ignore_case=False, output_cols=None) optimus.helpers.types.DataFrameType [source]
Replace a value, list of values by a specified regex.
- Parameters
cols – ‘*’, list of columns names or a single column name.
search – Values to look at to be replaced
replace_by – New value to replace the old one. Supports an array when searching by characters.
search_by – Can be “full”,”words”,”chars” or “values”.
ignore_case – Ignore case when searching for match
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- abstract static reverse(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Reverse the order of the characters strings in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- right(cols='*', n=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Get the substring from the last character to n.
- Parameters
cols – “*”, column name or list of column names to be processed.
n –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- round(cols='*', decimals=0, output_cols=None) optimus.helpers.types.DataFrameType [source]
Round a DataFrame to a variable number of decimal places.
- Parameters
cols – “*”, column name or list of column names to be processed.
decimals – The number of decimals you want to
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the round of each element.
- schema_data_type(cols='*', tidy=True)[source]
Return the column(s) data type as Type.
- Parameters
cols – Columns to be processed
tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.
- Returns
- second(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Get the seconds from a date in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
format –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- seconds_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the number of seconds between two dates.
- Parameters
cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- select(cols='*', regex=None, data_type=None, invert=False, accepts_missing_cols=False) optimus.helpers.types.DataFrameType [source]
Select columns using index, column name, regex to data type.
- Parameters
cols – “*”, column name or list of column names to be processed.
regex – Regular expression to filter the columns
data_type – Data type to be filtered for
invert – Invert the selection
accepts_missing_cols –
- Returns
- set(cols='*', value_func=None, where: Optional[Union[str, optimus.helpers.types.MaskDataFrameType]] = None, args=None, default=None, eval_value: bool = False) optimus.helpers.types.DataFrameType [source]
Set a column value using a number, string or an expression.
- Parameters
cols – Columns to set or create.
value_func – expression, function or value.
where – When the condition in ‘where’ is True, replace with ‘value_func’. Where False, replace with ‘default’ or keep the original value.
args – Argument when ‘value_func’ param is a function.
default – Entries where ‘where’ is False are replaced with corresponding value from other.
eval_value – Parse ‘value_func’ param in case a string is passed.
- Returns
- set_data_type(cols: Union[str, list, dict] = '*', data_types: Optional[Union[str, list]] = None, inferred: bool = False) optimus.helpers.types.DataFrameType [source]
Set profiler data type.
- Parameters
cols – A dict with the form {“col_name”: profiler datatype}, a list of columns or a single column.
data_types – If a string or a list passed to cols, uses this parameter to set the data types to those columns.
inferred – Whether it was inferred or not.
- Returns
Dataframe with new data types in the meta data.
- set_date_format(cols: Union[str, list, dict] = '*', date_formats: Optional[Union[str, list]] = None, inferred: bool = False) optimus.helpers.types.DataFrameType [source]
Set date format.
- Parameters
cols – A dict with the form {“col_name”: “date format”}, a list of columns or a single column
date_formats – If a string or a list passed to cols, uses this parameter to set the date format to those columns.
inferred – Whether it was inferred or not.
- Returns
- sin(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply sine function to a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the sine of each element.
- sinh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the hyperbolic sine function to a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the arctangent of each element.
- skew(cols='*', tidy=True, compute=True)[source]
Return the skew of the values over the requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.
- Returns
Return the skew of the values over the requested columns.
- slice(cols='*', start=None, stop=None, step=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Slice substrings from each element in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
start – Start position for slice operation.
stop – Stop position for slice operation.
step – Step size for slice operation.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- sort(order: Union[str, list] = 'asc', cols=None) optimus.helpers.types.DataFrameType [source]
Sort one or multiple columns in asc or desc order.
- Parameters
order – ‘asc’ or ‘desc’ accepted
cols –
- Returns
Column containing the cumulative sum.
- soundex(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the Soundex algorithm to a specified column. Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.
- Parameters
cols – ‘*’, list of columns names or a single column name.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- sqrt(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the square root of each value in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the square root of each element.
- standard_scaler(cols='*', output_cols=None)[source]
Standardize features by removing the mean and scaling to unit variance.
- Parameters
cols – ‘*’, list of columns names or a single column name
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- std(cols='*', tidy=True, compute=True)[source]
Return unbiased variance over requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.
compute – Compute the final result. False imply to return a delayed object.
- Returns
- stem_verbs(cols='*', stemmer: str = 'porter', language: str = 'english', output_cols=None) optimus.helpers.types.DataFrameType [source]
- Parameters
cols – “*”, column name or list of column names to be processed.
stemmer – snowball, porter, lancaster
language –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- abstract static string_to_index(cols=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Encodes a string column of labels to a column of label indices.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- strip_html(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Remove HTML tags.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- sub(cols='*', output_col=None) optimus.helpers.types.DataFrameType [source]
Subtract two or more columns.
- Parameters
cols – ‘*’, list of columns names or a single column name
output_col – Single output column in case no value is passed
- Returns
Dataframe with the result of the arithmetic operation appended.
- sub_domain(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Returns the subdomain string from a url. From https://www.hi-optimus.com it returns ‘www’.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- sum(cols='*', tidy=True, compute=True)[source]
Return the sum of the values over the requested column.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If True it will return a value if you process a column or column name and value if not. If False it will return the functions name, the column name.
compute – Compute the final result. False imply to return a delayed object.
- Returns
Column containing the sum of multiple columns.
- tan(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the tangent function to a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Column containing the tangent of each element.
- tanh(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Apply the hyperbolic tangent function to a column. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return: Column containing the hyperbolic tangent of each element.
- time_between(cols='*', value=None, date_format=None, round=None, output_cols=None, func=None) optimus.helpers.types.DataFrameType [source]
Returns a TimeDelta of the units between two datetimes.
- Parameters
cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.
func – Custom function to pass to the apply, like self.F.days_between
- Returns
- title(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Capitalize the first word in a sentence.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
BaseDataFrame
- to_boolean(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Cast the elements inside a column or a list of columns to boolean. :param cols: “*”, column name or list of column names to be processed. :param output_cols: :return:
- to_datetime(cols='*', format=None, output_cols=None, transform_format=True) optimus.helpers.types.DataFrameType [source]
TODO:?
- Parameters
cols – “*”, column name or list of column names to be processed.
format –
output_cols – Column name or list of column names where the transformed data will be saved.
transform_format –
- Returns
- to_float(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Cast the elements inside a column or a list of columns to float. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return:
- to_integer(cols='*', default=0, output_cols=None) optimus.helpers.types.DataFrameType [source]
Cast the elements inside a column or a list of columns to integer. :param cols: “*”, column name or list of column names to be processed. :param default: :param output_cols: Column name or list of column names where the transformed data will be saved. :return:
- to_string(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Cast the elements inside a column or a list of columns to string. :param cols: “*”, column name or list of column names to be processed. :param output_cols: :return:
- abstract static to_timestamp(cols, date_format=None, output_cols=None)[source]
- Parameters
cols –
date_format –
output_cols –
- Returns
- top_domain(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Returns the top domain string from a url. From ‘https://www.hi-optimus.com’ it returns ‘hi-optimus.com’.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- trim(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Remove leading and trailing characters.
Strip whitespaces (including newlines) or a set of specified characters from each string in the column from left and right sides. :param cols: “*”, column name or list of column names to be processed. :param output_cols: Column name or list of column names where the transformed data will be saved. :return:
- unique_values(cols='*', estimate=False, compute=True, tidy=True) list [source]
Return a list of uniques values in a column.
- Parameters
cols – ‘*’, list of columns names or a single column name.
estimate –
compute – Compute the result or return a delayed function.
tidy – The result format. If True it will return a value if you
process a column or column name and value if not. If False it will return the functions name, the column name and the value.
- unnest(cols='*', separator=None, splits=2, index=None, output_cols=None, drop=False, mode='string') optimus.helpers.types.DataFrameType [source]
Split the columns values (array or string) in different columns.
- Parameters
cols – Columns to be un-nested
output_cols – Resulted on or multiple columns after the unnest operation [(output_col_1_1,output_col_1_2),
(output_col_2_1, output_col_2] :param separator: char or regex :param splits: Number of columns splits. :param index: Return a specific index per columns. [1,2] :param drop: :param mode:
- unset_data_type(cols='*')[source]
Unset user set data type.
- Parameters
cols – ‘*’, list of columns names or a single column name.
- Returns
- unset_date_format(cols='*')[source]
Unset user defined date format.
- Parameters
cols – ‘*’, list of columns names or a single column name.
- Returns
- upper(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Uppercase the specified columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
BaseDataFrame
- url_file(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Returns the file string from a url. From https://www.hi-optimus.com/index.html it returns ‘index.html’.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- url_fragment(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- url_path(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Returns the top domain string from a url. From https://www.hi-optimus.com it returns ‘hi-optimus.com’.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- url_query(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Returns the query string from a url. From https://www.hi-optimus.com/?rollout=true it returns ‘roolout=true’.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- url_scheme(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Returns the top domain string from a url. From ‘https://www.hi-optimus.com’ it returns ‘https’.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- var(cols='*', tidy=True, compute=True)[source]
Return unbiased variance over requested columns.
- Parameters
cols – “*”, column name or list of column names to be processed.
tidy – The result format. If tidy it will return a value if you process a column or column name and value if not.
compute – Compute the final result. False imply to return a delayed object.
- Returns
- weekday(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Get the hour from a date in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
format –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- word_count(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Count the number of words in a paragraph.
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- word_tokenize(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
- Parameters
cols – “*”, column name or list of column names to be processed.
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- year(cols='*', format: Optional[str] = None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Get the Year from a date in a column.
- Parameters
cols – “*”, column name or list of column names to be processed.
format – String format
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- years_between(cols='*', value=None, date_format=None, round=None, output_cols=None) optimus.helpers.types.DataFrameType [source]
Return the number of years between two dates.
- Parameters
cols – “*”, column name or list of column names to be processed.
value –
date_format –
round –
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
- z_score(cols='*', output_cols=None) optimus.helpers.types.DataFrameType [source]
Returns the z-score of the given columns.
- Parameters
cols – ‘*’, list of columns names or a single column name
output_cols – Column name or list of column names where the transformed data will be saved.
- Returns
Dataframe with the z-score of the given columns appended.