wwdata package

Submodules

wwdata.Class_HydroData module

Class_HydroData provides functionalities for handling data obtained in the context of (waste)water treatment.

Copyright (C) 2016 Chaim De Mulder

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

class wwdata.Class_HydroData.HydroData(data, timedata_column='index', data_type='WWTP', experiment_tag='No tag given', time_unit=None, units=[])[source]

Bases: object

timedata_column

str – name of the column containing the time data

data_type

str – type of data provided

experiment_tag

str – A tag identifying the experiment; can be a date or a code used by the producer/owner of the data.

time_unit

str – The time unit in which the time data is given

units

array – The units of the variables in the columns

absolute_to_relative(time_data='index', unit='d', inplace=True, save_abs=True, decimals=5)[source]

converts a pandas series with datetime timevalues to relative timevalues in the given unit, starting from 0

Parameters:
  • time_data (str) – name of the column containing the time data. If this is the index column, just give ‘index’ (also default)
  • unit (str) – unit to which to convert the time values (sec, min, hr or d)
Returns:

  • None if inplace is True
  • HydroData object if inplace it False

add_to_meta_valid(column_names)[source]

Adds (a) column(s) with the given column_name(s) to the self.meta_filled DataFrame, where all tags are set to ‘original’. This makes sure that also data that already is very reliable can be used further down the process (e.g. filling etc.)

Parameters:column_names (array) – array containing the names of the columns to add to the meta_valied dataframe
calc_daily_profile(column_name, arange, quantile=0.9, plot=False, plot_method='quantile', clear=False, only_checked=False)[source]

Calculates a typical daily profile based on data from the indicated consecutive days. Also saves this average day, along with standard deviation and lower and upper percentiles as given in the arguments. Plotting is possible.

Parameters:
  • column_name (str) – name of the column containing the data to calculate an average day for
  • arange (2-element array of ints) – contains the beginning and end day of the period to use for average day calculation
  • quantile (float between 0 and 1) – value to use for the calculation of the quantiles
  • plot (bool) – plot or not
  • plot_method (str) – method to use for plotting. Available: “quantile” or “stdev”
  • clear (bool) – wether or not to clear the key in the self.daily_profile dictionary that is already present
Returns:

creates a dictionary self.daily_profile containing information on the average day as calculated.

Return type:

None

calc_ratio(data_1, data_2, arange, only_checked=False)[source]

Given two datasets or -columns, calculates the average ratio between the first and second dataset, within the given range. Also the standard deviation on this is calculated

Parameters:
  • data_1 (str) – name of the data column containing the data to be in the numerator of the ratio calculation
  • data_2 (str) – name of the data column containing the data to be in the denominator of the ratio calculation
  • arange (array of two values) – the range within which the ratio needs to be calculated
  • only_checked (bool) – if ‘True’, filtered values are excluded; default to ‘False’
Returns:

  • The average ratio of the first data column over the second one within
  • the given range and including the standard deviation

calc_slopes(xdata, ydata, time_unit=None, slope_range=None)[source]

Calculates slopes for given xdata and data_name; if a time unit is given as an argument, the time values (xdata) will first be converted to this unit, which will then be used to calculate the slopes with.

Parameters:
  • xdata (str) – name of the column containing the xdata for slope calculation (e.g. time). If ‘index’, the index is used as xdata. If datetime objects, a time_unit is expected to calculate the slopes.
  • data_name (str) – name of the column containing the data_name for slope calculation
  • time_unit (str) – time unit to be used for the slope calculation (in case this is based on time); if None, slopes are simply calculated based on the values given !! This value has no impact if the xdata column is the index and is not a datetime type. If that is the case, it is assumed that the user knows the unit of the xdata !!
Returns:

pandas Series object containing the slopes calculated for the chosen variable

Return type:

pd.Series

compare_ratio(data_1, data_2, arange, only_checked=False)[source]

Compares the average ratios of two datasets in multiple different ranges and returns the most reliable one, based on the standard deviation on the ratio values

Parameters:
  • data_1 (str) – name of the data column containing the data to be in the numerator of the ratio calculation
  • data_2 (str) – name of the data column containing the data to be in the denominator of the ratio calculation
  • arange (int) – the range (in days) for which the ratios need to be calculated and compared
  • only_checked (bool) – if ‘True’, filtered values are excluded; default to ‘False’
Returns:

  • The average ratio within the range that has been found to be the most
  • reliable one

drop_index_duplicates()[source]

drop rows with a duplicate index. Also updates the meta_valid dataframe

Note

It is assumed that the dropped rows containt the same data as their index- based duplicate, i.e. that no data is lost using the function.

fill_index(arange, index_type='float')[source]

function to fill in missing index values

get_avg(name=None, only_checked=True)[source]

Gets the averages of all or certain columns in a dataframe

Parameters:name (arary of str) – name(s) of the column(s) containing the data to be averaged; defaults to [‘none’] and will calculate average for every column
Returns:pandas dataframe, containing the average slopes of all or certain columns
Return type:pd.DataFrame
get_correlation(data_1, data_2, arange, zero_intercept=False, only_checked=False, plot=False)[source]

Calculates the linear regression coefficients that relate data_1 to data_2

Parameters:
  • and data_2 (data_1) – names of the data columns containing the data between which the correlation will be calculated.
  • arange (array) – array containing the beginning and end value between which the correlation needs to be calculated
  • zero_intercept (bool) – indicates whether or not to assume a zero-intercept
  • only_checked (bool) – if ‘True’, filtered values are excluded from calculation and plotting; default to ‘False’ if a value in one column is filtered, the corresponding value in the second column also gets excluded!
Returns:

  • the linear regression coefficients of the correlation, as well as the
  • r-squared -value

get_highs(data_name, bound_value, arange, method='percentile', plot=False)[source]

creates a dataframe with tags indicating what indices have data-values higher than a certain value; example: the definition/tagging of rain events.

Parameters:
  • data_name (str) – name of the column to execute the function on
  • bound_value (float) – the boundary value above which points will be tagged
  • arange (array of two values) – the range within which high values need to be tagged
  • method (str (value or percentile)) – when percentile, the bound value is a given percentile above which data points will be tagged, when value, bound_values is used directly to tag data points.
Returns:

Return type:

None

get_std(name=None, only_checked=True)[source]

Gets the standard deviations of all or certain columns in a dataframe

Parameters:
  • dataframe (pd.DataFrame) – dataframe containing the columns to calculate the standard deviation for
  • name (arary of str) – name(s) of the column(s) containing the data to calculate standard deviation for; defaults to [‘none’] and will calculate standard deviation for every column
  • plot (bool) – if True, plots the calculated standard deviations, defaults to False
Returns:

pandas dataframe, containing the average slopes of all or certain columns

Return type:

pd.DataFrame

head(n=5)[source]

piping pandas head function, see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html for documentation

index()[source]

piping pandas index function, see http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Index.html for documentation

moving_average_filter(data_name, window, cutoff_frac, arange, clear=False, inplace=False, log_file=None, plot=False, final=False)[source]

Filters out the peaks/outliers in a dataset by comparing its values to a smoothened representation of the dataset (Moving Average Filtering). The filtered values are replaced by NaN values.

Parameters:
  • data_name (str) – name of the column containing the data that needs to be filtered
  • window (int) – the number of values from the dataset that are used to take the average at the current point.
  • cutoff_frac (float) – the cutoff value (in fraction 0-1) to compare the data and smoothened data: a deviation higher than a certain percentage drops the data- point.
  • arange (array of two values) – the range within which the moving average filter needs to be applied
  • clear (bool) – if True, the tags added to datapoints before will be removed and put back to ‘original’.
  • inplace (bool) – indicates whether a new dataframe is created and returned or whether the operations are executed on the existing dataframe (nothing is returned)
  • log_file (str) – string containing the directory to a log file to be written out when using this function
  • plot (bool) – if true, a plot is made, comparing the original dataset with the new, filtered dataset
  • final (bool) – if true, the values are actually replaced with nan values (either inplace or in a new hp object)
Returns:

  • HydroData object (if inplace=False) – the dataframe from which the double values of ‘data’ are removed
  • None (if inplace=True)

moving_slope_filter(xdata, data_name, cutoff, arange, time_unit=None, clear=False, inplace=False, log_file=None, plot=False, final=False)[source]

Filters out datapoints based on the difference between the slope in one point and the next (sudden changes like noise get filtered out), based on a given cut off value. Replaces the dropped values with NaN values.

Parameters:
  • xdata (str) – name of the column containing the xdata for slope calculation (e.g. time). If ‘index’, the index is used as xdata. If datetime objects, a time_unit is expected to calculate the slopes.
  • data_name (str) – name of the column containing the data that needs to be filtered
  • cutoff (int) – the cutoff value to compare the slopes with to apply the filtering.
  • arange (array of two values) – the range within which the moving slope filter needs to be applied
  • time_unit (str) – time unit to be used for the slope calculation (in case this is based on time); if None, slopes are calculated based on the values given
  • clear (bool) – if True, the tags added to datapoints before will be removed and put back to ‘original’.
  • inplace (bool) – indicates whether a new dataframe is created and returned or whether the operations are executed on the existing dataframe (nothing is returned)
  • log_file (str) – string containing the directory to a log file to be written out when using this function
  • plot (bool) – if true, a plot is made, comparing the original dataset with the new, filtered dataset
  • final (bool) – if true, the values are actually replaced with nan values (either inplace or in a new hp object)
Returns:

  • HydroData object (if inplace=False) – the dataframe from which the double values of ‘data’ are removed
  • None (if inplace=True)
  • Creates
  • ——-
  • A new column in the self.meta_valid dataframe, containing a mask indicating
  • what values are filtered

plot_analysed(data_name, time_range='default', only_checked=False)[source]

plots the values and their types (original, filtered, filled) of a given column in the given time range.

Parameters:
  • data_name (str) – name of the column containing the data to plot
  • time_range (array of two values) – the range within which the values are plotted; default is all
  • only_checked (bool) – if ‘True’, filtered values are excluded; default to ‘False’
Returns:

Return type:

Plot

replace(to_replace, value, inplace=False)[source]

piping pandas replace function, see http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.replace.html for documentation

savgol(data_name, window=55, polyorder=2, plot=False, inplace=False)[source]

Uses the scipy.signal Savitzky-Golay filter to smoothen the data of a column; The values are either replaced or a new dataframe is returned.

Parameters:
  • data_name (str) – name of the column containing the data that needs to be filtered
  • window (int) – the length of the filter window; default to 55
  • polyorder (int) – The order of the polynomial used to fit the samples. polyorder must be less than window. default to 1
  • plot (bool) – if true, a plot is made, comparing the original dataset with the new, filtered dataset
  • inplace (bool) – indicates whether a new dataframe is created and returned or whether the operations are executed on the existing dataframe (nothing is returned)
Returns:

  • HydroData object (if inplace=False)
  • None (if inplace=True)

set_index(keys, key_is_time=False, drop=True, inplace=False, verify_integrity=False, save_prev_index=True)[source]

piping and extending pandas set_index function, see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html for documentation

Notes

key_is_time : bool
when true, the new index will we known as the time data from here on

(other arguments cfr pd.set_index)

Returns:
  • HydroData object (if inplace=False)
  • None (if inplace=True)
set_tag(tag)[source]

Sets the tag element of the HydroData object to the given tag

Returns:
Return type:None
set_time_unit(unit)[source]

Sets the time_unit element of the HydroData object to a given unit

Returns:
Return type:None
set_units(units)[source]

Set the units element of the HydroData object to a given dataframe

simple_moving_average(arange, window, data_name=None, inplace=False, plot=True)[source]

Calculate the Simple Moving Average of a dataseries from a dataframe, using a window within which the datavalues are averaged.

Parameters:
  • arange (array of two values) – the range within which the moving average needs to be calculated
  • window (int) – the number of values from the dataset that are used to take the average at the current point. Defaults to 10
  • data_name (str or array of str) – name of the column(s) containing the data that needs to be smoothened. If None, smoothened data is computed for the whole dataframe. Defaults to None
  • inplace (bool) – indicates whether a new dataframe is created and returned or whether the operations are executed on the existing dataframe (nothing is returned)
  • plot (bool) – if True, a plot is given for comparison between original and smooth data
Returns:

either a new object (inplace=False) or an adjusted object, con- taining the smoothened data values

Return type:

HydroData (or subclass) object

tag_doubles(data_name, bound, arange=None, clear=False, inplace=False, log_file=None, plot=False, final=False)[source]

tags double values that subsequently occur in a measurement series. This is relevant in case a sensor has failed and produces a constant signal. A band is provided within which the signal can vary and still be filtered out

Parameters:
  • data_name (str) – column name of the column from which double values will be sought
  • bound (float) – boundary value of the band to use. When the difference between a point and the next one is smaller then the bound value, the latter datapoint is tagged as ‘filtered’.
  • arange (array of two values) – the range within which double values need to be tagged
  • clear (bool) – if True, the tags added to datapoints before will be removed and put back to ‘original’.
  • inplace (bool) – indicates whether a new dataframe is created and returned or whether the operations are executed on the existing dataframe (nothing is returned). (This argument only comes into play when the ‘final’ argument is True)
  • log_file (str) – string containing the directory to a log file to be written out when using this function
  • plot (bool) – whether or not to make a plot of the newly tagged data points
  • final (bool) – if true, the values are actually replaced with nan values (either inplace or in a new hp object)
Returns:

  • HydroData object (if inplace=False) – the dataframe from which the double values of ‘data’ are removed or replaced
  • None (if inplace=True)

tag_extremes(data_name, arange=None, limit=0, method='below', clear=False, plot=False)[source]

Tags values above or below a given limit.

Parameters:
  • data_name (str) – name of the column containing the data to be tagged
  • arange (array of two values) – the range within which extreme values need to be tagged
  • limit (int/float) – limit below or above which values need to be tagged
  • method ('below' or 'above') – below tags all the values below the given limit, above tags the values above the limit
  • clear (bool) – if True, the tags added before will be removed and put back to ‘original’.
  • plot (bool) – whether or not to make a plot of the newly tagged data points
Returns:

Return type:

None;

tag_nan(data_name, arange=None, clear=False)[source]

adds a tag ‘filtered’ in self.meta_valid for every NaN value in the given column

Parameters:
  • data_name (str) – column name of the column to apply the function to
  • arange (array of two values) – the range within which nan values need to be tagged
  • clear (bool) – when true, resets the tags in meta_valid for the data in column data_name
Returns:

Return type:

None

tail(n=5)[source]

piping pandas tail function, see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html for documentation

to_datetime(time_column='index', time_format='%dd-%mm-%yy', unit='D')[source]

Piping and modifying pandas to_datetime function

Parameters:
  • time_column (str) – column name of the column where values need to be converted to date- time values. Default ‘index’ converts index values to datetime
  • time_format (str) – the format to use by to_datetime function to convert strings to datetime format
  • unit (str) – unit to use by to_datetime function to convert int or float values to datetime format
to_float(columns='all')[source]

convert values in given columns to float values

Parameters:columns (array of strings) – column names of the columns where values need to be converted to floats
write(filename, filepath='/Users/chaimdemulder/Documents/Work/github/wwdata/docs', method='all')[source]
Parameters:
  • filepath (str) – the path the output file should be saved to
  • filename (str) – the name of the output file
  • method (str (all,filtered,filled)) – depending on the method choice, different values will be written out: all values, only the filtered values or the filled values
  • for_WEST (bool) –
  • include_units (bool) –
Returns:

Return type:

None; write an output file

wwdata.Class_HydroData.total_seconds(timedelta_value)[source]

wwdata.Class_LabExperimBased module

Class_LabExperimBased provides functionalities for data handling of data obtained in lab experiments in the field of (waste)water treatment. Copyright (C) 2016 Chaim De Mulder

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

class wwdata.Class_LabExperimBased.LabExperimBased(data, timedata_column='index', data_type='NAT', experiment_tag='No tag given', time_unit=None)[source]

Bases: wwdata.Class_HydroData.HydroData

Superclass for a HydroData object, expanding the functionalities with specific functions for data gathered is lab experiments.

timedata_column

str – name of the column containing the time data

data_type

str – type of the data provided

experiment_tag

str – A tag identifying the experiment; can be a date or a code used by the producer/owner of the data.

time_unit

str – The time unit in which the time data is given

units

array – The units of the variables in the columns

add_conc(column_name, x, y, new_name='default')[source]

calculates the concentration values of the given column and adds them as a new column to the DataFrame.

Parameters:
  • column_name (str) – column with values
  • x (int) –

  • y (int) –

  • new_name (str) – name of the new column, default to ‘column_name + mg/L’
calc_slope(columns, time_column='h')[source]

calculates the slope of the selected columns

Parameters:
  • columns (array of strings) – columns to calculate the slope for
  • time_column (str) – time used for calculation; default to ‘h’
check_ph(ph_column='pH', thresh=0.4)[source]

gives the maximal change in pH

Parameters:
  • ph_column (str) – column with pH-values, default to ‘pH’
  • threshold (int) – threshold value for warning, default to ‘0.4’
hours(time_column='index')[source]

calculates the hours from the relative values

Parameters:time_column (string) – column containing the relative time values; default to index
in_out(columns)[source]

(start_values-end_values)

Parameters:columns (array of strings) –
plot(columns, time_column='index')[source]

calculates the slope of the selected columns

Parameters:
  • columns (array of strings) – columns to plot
  • time_column (str) – time used for calculation; default to ‘h’
removal(columns)[source]

total removal of nitrogen (1-(end_values/start_values))

Parameters:columns (array of strings) –

wwdata.Class_LabSensorBased module

Class_LabSensorBased provides functionalities for data handling of data obtained in lab experiments with online sensors in the field of (waste)water treatment. Copyright (C) 2016 Chaim De Mulder

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

class wwdata.Class_LabSensorBased.LabSensorBased(data, experiment_tag='None')[source]

Bases: wwdata.Class_HydroData.HydroData

Superclass for a HydroData object, expanding the functionalities with specific functions for data gathered is lab experiments

timedata_column

str – name of the column containing the time data

data_type

str – type of data provided

experiment_tag

str – A tag identifying the experiment; can be a date or a code used by the producer/owner of the data.

time_unit

str – The time unit in which the time data is given

units

array – The units of the variables in the columns

drop_peaks(data_name, cutoff, inplace=True, log_file=None)[source]

Filters out the peaks larger than a cut-off value in a dataseries

Parameters:
  • data_name (str) – the name of the column to use for the removal of peak values
  • cutoff (int) – cut off value to use for the removing of peaks; values with an absolute value larger than this cut off will be removed from the data
  • inplace (bool) – indicates whether a new dataframe is created and returned or whether the operations are executed on the existing dataframe (nothing is returned)
  • log_file (str) – string containing the directory to a log file to be written out when using this function
Returns:

  • LabSensorBased object (if inplace=False) – the dataframe from which the double values of ‘data’ are removed
  • None (if inplace=True)

wwdata.Class_OnlineSensorBased module

Class_OnlineSensorBased provides functionalities for data handling of data obtained with online sensors in the field of (waste)water treatment. Copyright (C) 2016 Chaim De Mulder

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

class wwdata.Class_OnlineSensorBased.OnlineSensorBased(data, timedata_column='index', data_type='WWTP', experiment_tag='No tag given', time_unit=None)[source]

Bases: wwdata.Class_HydroData.HydroData

Superclass for a HydroData object, expanding the functionalities with specific functions for data gathered at full scale by continous measurements

timedata_column

str – name of the column containing the time data

data_type

str – type of data provided

experiment_tag

str – A tag identifying the experiment; can be a date or a code used by the producer/owner of the data.

time_unit

str – The time unit in which the time data is given

units

array – The units of the variables in the columns

add_to_filled(column_names)[source]

column_names : array

calc_daily_average(column_name, arange, plot=False)[source]

calculates the daily average of values in the given column and returns them as a 2D-array, containing the days and the average values on the respective days. Plotting is possible.

Parameters:
  • column_name (str) – name of the column containing the data to calculate the average values for
  • arange (array of two values) – the range within which daily averages need to be calculated
  • plot (bool) – plot or not
Returns:

pandas dataframe, containing the daily means with standard deviations for the selected column

Return type:

pd.Dataframe

calc_total_proportional(Q_tot, Q, conc, new_name='new', unit='mg/l', filled=False)[source]

Calculates the total concentration of an incoming flow, based on the given total flow and the separate incoming flows and concentrations

Parameters:
  • Q_tot (str) – name of the column containing the total flow
  • Q (array of str) – names of the columns containing the separate flows
  • conc (array of str) – names of the columns containing the separate concentration values
  • new_name (str) – name of the column to be added
  • filled (bool) – if true, use self.filled to calculate proportions from

Note

!!Order of columns in Q and conc must match!!

Returns:
  • None;
  • creates a hydropy object with added column for the proportional concentration
check_filling_error(nr_iterations, data_name, filling_function, test_data_range, nr_small_gaps=0, max_size_small_gaps=0, nr_large_gaps=0, max_size_large_gaps=0, **options)[source]

Uses the _calculate_filling_error function (refer to that docstring for more specific info) to calculate the error on the data points that are filled with a certain algorithm. Because _calculate_filling_error inserts random gaps, results differ every time it is used. Check_filling_error averages this out.

Parameters:
  • nr_iterations (int) – The number of iterations to run for the calculation of the imputation error
  • data_name (string) – name of the column containing the data the filling reliability needs to be checked for.
  • function (filling) – the name of the filling function to be tested for reliability
  • test_data_range (array of two values) – an array containing the start and end point of the test data to be used. IMPORTANT: for testing filling with correlation, this range needs to include the range for correlation calculation and the filling range.
  • / nr_large_gaps (nr_small_gaps) – the number of small/large gaps to create in the dataset for testing
  • / max_size_large_gaps (max_size_small_gaps) – the maximum size of the gaps inserted in the data, expressed in data points
  • **options – Arguments for the filling function; refer to the relevant filling function to know what arguments to give

Note

When checking for the error on data filling, a period (arange argument) with mostly reliable data should be used. If for example large gaps are already present in the given data, this will heavily influence the returned error, as filled values will be compared with the values from the data gap.

Returns:adds the average filling error the self.filling_error dataframe
Return type:None
drop_index_duplicates()[source]

drop rows with a duplicate index. Also updates the meta_valid, meta_filled and filled dataframes

Note

This operation assumes the dropped rows have the same data in them and therefor no data is lost.

fill_missing_correlation(to_fill, to_use, arange, corr_range, zero_intercept=False, filtered_only=True, plot=False, clear=False)[source]

Fills the missing values in a dataset (to_fill), based on the correlation this data shows when comparing to other data (to_use). This happens within the range given by arange.

Parameters:
  • to_fill (str) – name of the column with data to fill
  • to_use (str) – name of the column to use, in combination with the given ratio, to fill in some of the missing data
  • arange (array of two values) – the range within which missing/filtered values need to be replaced
  • corr_range (array of two values) – the range to use for the calculation of the correlation
  • filtered_only (boolean) – if True, fills only the datapoints labeled as filtered. If False, fills/replaces all datapoints in the given range
  • plot (bool) – whether or not to plot the new dataset
  • clear (bool) – whether or not to clear the previoulsy filled values and start from the self.meta_valid dataset again for this particular dataseries.
Returns:

  • None;
  • creates/updates self.filled, containing the adjusted dataset and updates
  • meta_filled with the correct labels.

fill_missing_daybefore(to_fill, arange, range_to_replace=[1, 4], filtered_only=True, plot=False, clear=False)[source]

Fills the missing values in a dataset (to_fill), based on the data values from the day before the range starts. These data values are based on the self.filled dataset and therefor can contain filled datapoints as well. This happens within the range given by arange. !! IMPORTANT !! This function will not work on datasets with non-equidistant data points!

Parameters:
  • to_fill (str) – name of the column with data to fill
  • arange (array of two values) – the range within which missing/filtered values need to be replaced
  • range_to_replace (array of two int/float values) – the minimum and maximum amount of time (i.e. min and max size of gaps in data) where missing datapoints can be replaced using this function, i.e. using values of the last day before measurements went bad.
  • filtered_only (boolean) – if True, fills only the datapoints labeled as filtered. If False, fills/replaces all datapoints in the given range
  • plot (bool) – whether or not to plot the new dataset
  • clear (bool) – whether or not to clear the previoulsy filled values and start from the self.meta_valid dataset again for this particular dataseries.
Returns:

  • None;
  • creates/updates self.filled, containing the adjusted dataset and updates
  • meta_filled with the correct labels.

fill_missing_interpolation(to_fill, range_, arange, method='index', plot=False, clear=False)[source]

Fills the missing values in a dataset (to_fill), based specified interpolation algorithm (method). This happens only if the number of consecutive missing values is smaller than range_.

Parameters:
  • to_fill (str) – name of the column containing the data to be filled
  • range (int) – the maximum range that the absence of values can be to still allow interpolation to fill in values
  • arange (array of two values) – the range within which missing/filtered values need to be replaced
  • method (str) – interpolation method to be used by the .interpolate function. See pandas docstrings for more info
  • plot (bool) – whether or not to plot the new dataset
  • clear (bool) – whether or not to clear the previoulsy filled values and start from the self.meta_valid dataset again for this particular dataseries.
Returns:

  • None;
  • creates/updates self.filled, containing the adjusted dataset and updates
  • meta_filled with the correct labels.

fill_missing_model(to_fill, to_use, arange, filtered_only=True, unit='d', plot=False, clear=False)[source]

Fills the missing values in a dataset (to_fill), based on the modeled values given in to_use. This happens within the range given by arange.

Parameters:
  • to_fill (str) – name of the column with data to fill
  • to_use (pd.Series) – pandas series containing the modeled data with which the filtered data can be replaced
  • arange (array of two values) – the range within which missing/filtered values need to be replaced
  • filtered_only (boolean) – if True, fills only the datapoints labeled as filtered. If False, fills/replaces all datapoints in the given range
  • unit (str) – the unit in which the modeled values are given; datetime values will be converted to values with that unit. Possible: sec, min, hr, d
  • plot (bool) – whether or not to plot the new dataset
  • clear (bool) – whether or not to clear the previoulsy filled values and start from the self.meta_valid dataset again for this particular dataseries.
Returns:

  • None;
  • creates/updates self.filled, containing the adjusted dataset and updates
  • meta_filled with the correct labels.

fill_missing_ratio(to_fill, to_use, ratio, arange, filtered_only=True, plot=False, clear=False)[source]

Fills the missing values in a dataset (to_fill), based on the ratio this data shows when comparing to other data (to_use). This happens within the range given by arange.

Parameters:
  • to_fill (str) – name of the column with data to fill
  • to_use (str) – name of the column to use, in combination with the given ratio, to fill in some of the missing data
  • ratio (float) – ratio to multiply the to_use data with to obtain data for filling in in the to_fill data column
  • arange (array of two values) – the range within which missing/filtered values need to be replaced
  • filtered_only (boolean) – if True, fills only the datapoints labeled as filtered. If False, fills/replaces all datapoints in the given range
  • plot (bool) – whether or not to plot the new dataset
  • clear (bool) – whether or not to clear the previoulsy filled values and start from the self.meta_valid dataset again for this particular dataseries.
Returns:

  • None;
  • creates/updates self.filled, containing the adjusted dataset and updates
  • meta_filled with the correct labels.

fill_missing_standard(to_fill, arange, filtered_only=True, plot=False, clear=False)[source]

Fills the missing values in a dataset (to_fill), based on the average daily profile calculated by calc_daily_profile(). This happens within the range given by arange.

Parameters:
  • to_fill (str) – name of the column with data to fill
  • arange (array of two values) – the range within which missing/filtered values need to be replaced
  • filtered_only (boolean) – if True, fills only the datapoints labeled as filtered. If False, fills/replaces all datapoints in the given range
  • plot (bool) – whether or not to plot the new dataset
  • clear (bool) – whether or not to clear the previoulsy filled values and start from the self.meta_valid dataset again for this particular dataseries.
Returns:

  • None;
  • creates/updates self.filled, containing the adjusted dataset and updates
  • meta_filled with the correct labels.

wwdata.Class_OnlineSensorBased.absolute_to_relative(series, start_date, unit='d', decimals=5)[source]

converts a pandas series with datetime timevalues to relative timevalues in the given unit, starting from start_date

Parameters:
  • series (pd.Series) – series of datetime of comparable values
  • unit (str) – unit to which to convert the time values (sec, min, hr or d)
  • output
  • ------
wwdata.Class_OnlineSensorBased.drop_peaks(self, data_name, cutoff, inplace=True, log_file=None)[source]

Filters out the peaks larger than a cut-off value in a dataseries

Parameters:
  • data_name (str) – the name of the column to use for the removal of peak values
  • cutoff (int) – cut off value to use for the removing of peaks; values with an absolute value larger than this cut off will be removed from the data
  • inplace (bool) – indicates whether a new dataframe is created and returned or whether the operations are executed on the existing dataframe (nothing is returned)
  • log_file (str) – string containing the directory to a log file to be written out when using this function
Returns:

  • LabSensorBased object (if inplace=False) – the dataframe from which the double values of ‘data’ are removed
  • None (if inplace=True)

wwdata.Class_OnlineSensorBased.find_nearest_time(value, df, column)[source]

Returns the (time) value in a dataframe column nearest to a given value

Parameters:
  • value (float) – time value to find the closest value for in ‘df’
  • df (pd.Dataframe) – dataframe to use
  • column (str) – column to check ‘value’ against
wwdata.Class_OnlineSensorBased.go_WEST(raw_data, time_data, WEST_name_conversion)[source]

Saves a WEST compatible file (influent or other inputs)

Parameters:
  • raw_data (str or pd DataFrame) –
  • time_data
  • WEST_name_conversion (pd DataFrame with column names: WEST, units and RAW) – dataframe containing three columns: the column names for the WEST-compatible file, the units to appear in the WEST-compatible file and the column names of the raw data file.
  • output
  • ------
  • None
wwdata.Class_OnlineSensorBased.total_seconds(timedelta_value)[source]
wwdata.Class_OnlineSensorBased.vlookup_day(value, df, column)[source]

Returns the dataframe index of a given value

wwdata.data_reading_functions module

data_reading_functions provides functionalities for data reading in the context of the wwdata package. Copyright (C) 2016 Chaim De Mulder

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

wwdata.data_reading_functions.find_and_replace(path, ext, replace)[source]

Finds the files with a certain extension in a directory and applies a find- replace action to those files. Removes the old files and produces files with a prefix stating the replacing value.

Parameters:
  • path (str) – the path name of the directory to apply the function to
  • ext (str) – the extension of the files to be searched (excel, text or csv)
  • replace (array of str) – the first value of replace is the string to be replaced by the second value of replace.
wwdata.data_reading_functions.join_files(path, files, ext='text', sep=', ', comment='#', encoding='utf8', decimal='.')[source]

Reads all files in a given directory, joins them and returns one pd.dataframe

Parameters:
  • path (str) – path to the folder that contains the files to be joined
  • files (list) – list of files to be joined, must be the same extension
  • ext (str) – extention of the files to read; possible: excel, text, csv
  • sep (str) – the separating element (e.g. , or ) necessary when reading csv-files
  • comment (str) – comment symbol used in the files
  • sort (array of bool and str) – if first element is true, apply the sort function to sort the data based on the tags in the column mentioned in the second element of the sort array
Returns:

pandas dataframe containin concatenated files in the given directory

Return type:

pd.dataframe

wwdata.data_reading_functions.list_files(path, ext)[source]

Returns a list of files in a certain folder (‘path’) with a certain extension (‘ext’)

Parameters:
  • path (str) – path to the folder containing the files to be listed
  • ext (str) – extension of the files to be listed; current options are ‘excel’,’text’ or ‘csv’
wwdata.data_reading_functions.read_mat(path)[source]

TO DO Reads in .mat datafiles and returns them as pd.DataFrame http://stackoverflow.com/questions/24762122/read-matlab-data-file-into-python-need-to-export-to-csv

wwdata.data_reading_functions.remove_empty_lines(path, ext)[source]

Removes the empty lines from files in a certain folder (‘path’) and with a certain extension (‘ext’)

Parameters:
  • path (str) – path to the folder containing the files in which empty lines need to be removed
  • ext (str) – extension of the files in which empty lines need to be removed; current options are ‘excel’,’text’ or ‘csv’
wwdata.data_reading_functions.sort_data(data, based_on, reset_index=[False, 'new_index_name'], convert_to_timestamp=[True, 'time_name', '%d.%m.%Y %H:%M:%S'])[source]

Sorts a dataset based on values in one of the columns and splits them in different dataframes, returned in the form of one dictionary

Parameters:
  • data (pd.dataframe) – the dataframe containing the data that needs to be sorted
  • based_on (str) – the name of the column that contains the names or values the sorting should be based on
  • reset_index ([bool,str]) – array indicating if the index of the sorted datasets should be reset to a new one; if first element is true, the second element is the title of the column to use as new index; default: False
Returns:

A dictionary of pandas dataframes with as labels those acquired from the based_on column

Return type:

dict

wwdata.data_reading_functions.write_to_WEST(df, file_normal, file_west, units, filepath='/Users/chaimdemulder/Documents/Work/github/wwdata/docs', fillna=True)[source]

writes a text-file that is compatible with WEST. Adds the units as they are given in the ‘units’ argument.

Parameters:
  • df (pd.DataFrame) – the dataframe to write to WEST
  • file_normal (str) – name of the original file to write, not yet compatible with WEST
  • file_west (str) – name of the file that needs to be WEST compatible
  • units (array of strings) – array containing the units for the respective columns in df
  • filepath (str) – directory to save the files in; defaults to the current one
  • fillna (bool) – when True, replaces nan values with 0 values (this might avoid WEST problems later one).
Returns:

Return type:

None; writes files

wwdata.time_conversion_functions module

time_conversion_functions provides functionalities for converting certain types of time data to other types, in the context of the wwdata package. Copyright (C) 2016 Chaim De Mulder

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

wwdata.time_conversion_functions.get_absolute_time(value, date_type='WindowsDateSystem')[source]

Converts a coded time to the absolute date at which the experiment was conducted.

Parameters:
Returns:

python datetime object

Return type:

datetime.datetime

wwdata.time_conversion_functions.make_datetime(array)[source]
Parameters:with elements (array) – 0: year (yy) 1: month (mm) 2: day in month (dd) 3: hour (h or hh) 4: minutes (minmin)
wwdata.time_conversion_functions.make_month_day_array()[source]

Returns a dataframe containing two columns, one with the number of the month, one with the day of the month. Useful in creating datetime objects based on e.g. date serial numbers from the Window Date System (http://excelsemipro.com/2010/08/date-and-time-calculation-in-excel/)

Returns:dataframe with number of the month and number of the day of the month for a whole year
Return type:pd.DataFrame
wwdata.time_conversion_functions.timedelta_to_abs(timedelta, unit='d')[source]

timedelta : array of timedelta values

wwdata.time_conversion_functions.to_datetime_singlevalue(time)[source]

In case timedata is in a string format, to convert it to a datetime object, it needs to be in the right format, e.g. dd-mm-yyyy hh:mm:ss (so two of each) This function takes care of that, to a certain extent.

wwdata.time_conversion_functions.to_days(timedelta)[source]

timedelta : timedelta value

wwdata.time_conversion_functions.to_hours(timedelta)[source]

timedelta : timedelta value

wwdata.time_conversion_functions.to_minutes(timedelta)[source]

timedelta : timedelta value

wwdata.time_conversion_functions.to_seconds(timedelta)[source]

timedelta : timedelta value

Module contents

Top-level package for wwdata.