Reducers

Question Reducer

This module provides functions to reduce the question task extracts from panoptes_aggregation.extractors.question_extractor.

panoptes_aggregation.reducers.question_reducer.question_reducer(data_list, pairs=False, track_user_ids=False, **kwargs)

Reduce a list of extracted questions into a “counter” dict

Parameters:
  • data_list (list) – A list of extractions created by panoptes_aggregation.extractors.question_extractor.question_extractor()

  • pairs (bool, optional) – Default False. How multiple choice questions are treated. When True the set of all choices is treated as a single answer

  • track_user_ids (bool, optional) – Default False. Set to True to also track the user_ids that gave each answer.

Returns:

reduction – A dictionary (formated as a Counter) giving the vote count for each key. If user_ids is True it will also contain a list of user_ids for each answer given.

Return type:

dict


Question Consensus Reducer

This module porvides functions to reduce the question task extracts from panoptes_aggregation.extractors.question_extractor.

panoptes_aggregation.reducers.question_consensus_reducer.question_consensus_reducer(data_list, pairs=False, **kwargs)

Reduce a list of extracted questions into a consensus description dict

Parameters:
Returns:

reduction – A dictinary with the following keys

  • most_likely : key with greatest number of classifications/votes

  • num_votes : vote count for mostly likely key

  • agreement : fraction of total votes held by most likely key.

Return type:

dict


Slider Reducer

This module provides functions to reduce the slider task extracts from panoptes_aggregation.extractors.slider_extractor.

panoptes_aggregation.reducers.slider_reducer.process_data(data, pairs=False)

Process a list of extracted slider into list

Parameters:

data (list) – A list of extractions created by panoptes_aggregation.extractors.question_extractor.slider_extractor()

Returns:

processed_data – A list of slider values, one for each extraction

Return type:

list

panoptes_aggregation.reducers.slider_reducer.slider_reducer(votes_list)

Reduce a list of slider values into a mean and median

Parameters:

votes_list (list) – A list of sldier values from process_data()

Returns:

reduction – A dictionary giving the mean, median, and variance of the slider values

Return type:

dict


Point Reducer

This module provides functions to cluster points extracted with panoptes_aggregation.extractors.point_extractor.

panoptes_aggregation.reducers.point_reducer.point_reducer(data_by_tool, **kwargs)

Cluster a list of points by tool using DBSCAN

This reducer is for use with panoptes_aggregation.extractors.point_extractor that does not seperate points by frame and does not support subtask reduction. Use panoptes_aggregation.extractors.point_extractor_by_frame and panoptes_aggregation.reducers.point_reducer_dbscan if there are multiple frames or subtasks.

Parameters:
Returns:

reduction – A dictinary with the following keys

  • tool*_points_x : A list of x positions for all points drawn with tool*

  • tool*_points_y : A list of y positions for all points drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all points drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_x : The x position for each cluster found

  • tool*_clusters_y : The y position for each cluster found

  • tool*_clusters_var_x : The x varaince of points in each cluster found

  • tool*_clusters_var_y : The y varaince of points in each cluster found

  • tool*_clusters_var_x_y : The x-y covaraince of points in each cluster found

Return type:

dict

panoptes_aggregation.reducers.point_reducer.process_data(data)

Process a list of extractions into lists of x and y sorted by tool.

Parameters:

data (list) – A list of extractions crated by panoptes_aggregation.extractors.point_extractor.point_extractor()

Returns:

processed_data – A dictionary with each key being a tool with a list of (x, y) tuples as a vlaue

Return type:

dict


Point Reducer DBSCAN

This module provides functions to cluster points extracted with panoptes_aggregation.extractors.point_extractor.

panoptes_aggregation.reducers.point_reducer_dbscan.point_reducer_dbscan(data_by_tool, **kwargs)

Cluster a list of points by tool using DBSCAN

Parameters:
  • data_by_tool (dict) – A dictionary returned by process_data()

  • kwargsSee DBSCAN

Returns:

reduction – A dictionary with one key per subject frame. Each frame has the following keys

  • tool*_points_x : A list of x positions for all points drawn with tool*

  • tool*_points_y : A list of y positions for all points drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all points drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_x : The x position for each cluster found

  • tool*_clusters_y : The y position for each cluster found

  • tool*_clusters_var_x : The x variance of points in each cluster found

  • tool*_clusters_var_y : The y variance of points in each cluster found

  • tool*_clusters_var_x_y : The x-y covariance of points in each cluster found

Return type:

dict


Point Reducer HDBSCAN

This module provides functions to cluster points extracted with panoptes_aggregation.extractors.point_extractor.

panoptes_aggregation.reducers.point_reducer_hdbscan.point_reducer_hdbscan(data_by_tool, **kwargs)

Cluster a list of points by tool using HDBSCAN

Parameters:
  • data_by_tool (dict) – A dictionary returned by process_data()

  • kwargsSee HDBSCAN

Returns:

reduction – A dictionary with one key per subject frame. Each frame has the following keys

  • tool*_points_x : A list of x positions for all points drawn with tool*

  • tool*_points_y : A list of y positions for all points drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all points drawn with tool*

  • tool*_cluster_probabilities: A list of cluster probabilities for all points drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_x : The weighted x position for each cluster found

  • tool*_clusters_y : The weighted y position for each cluster found

  • tool*_clusters_var_x : The weighted x variance of points in each cluster found

  • tool*_clusters_var_y : The weighted y variance of points in each cluster found

  • tool*_clusters_var_x_y : The weighted x-y covariance of points in each cluster found

Return type:

dict


Rectangle Reducer

This module provides functions to cluster rectangles extracted with panoptes_aggregation.extractors.rectangle_extractor.

panoptes_aggregation.reducers.rectangle_reducer.process_data(data)

Process a list of extractions into lists of x and y sorted by frame and tool

Parameters:

data (list) – A list of extractions crated by panoptes_aggregation.extractors.rectangle_extractor.rectangle_extractor()

Returns:

processed_data – A dictionary with each key being a frame dictionary values with keys being tool with a list of (x, y, width, height) tuples as a value

Return type:

dict

panoptes_aggregation.reducers.rectangle_reducer.rectangle_reducer(data_by_tool, **kwargs)

Cluster a list of rectangles by tool and frame

Parameters:
Returns:

reduction – A dictionary with the following keys for each frame

  • tool*_rec_x : A list of x positions for all rectangles drawn with tool*

  • tool*_rec_y : A list of y positions for all rectangles drawn with tool*

  • tool*_rec_width : A list of width values for all rectangles drawn with tool*

  • tool*_rec_height : A list of height values for all rectangles drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all rectangles drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_x : The x position for each cluster found

  • tool*_clusters_y : The y position for each cluster found

  • tool*_clusters_width : The width value for each cluster found

  • tool*_clusters_height : The height value for each cluster found

Return type:

dict


Shape Reducer DBSCAN

This module provides functions to cluster shapes extracted with panoptes_aggregation.extractors.shape_extractor.

panoptes_aggregation.reducers.shape_reducer_dbscan.shape_reducer_dbscan(data_by_tool, **kwargs)

Cluster a shape by tool using DBSCAN

Parameters:
  • data_by_tool (dict) – A dictionary returned by process_data()

  • metric_type (str) – Either “euclidean” to use a euclidean metric in the N-dimension shape parameter space or “IoU” for the intersection of union metric based on shape overlap. The IoU metric can only be used with the following shape:

    • rectangle

    • rotateRectangle

    • circle

    • ellipse

  • kwargsSee DBSCAN

Returns:

reduction – A dictionary with the following keys for each frame

  • tool*_<shape>_<param> : A list of all param for the shape drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all shapes drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_<param> : The param value for each cluster found

If the “IoU” metric type is used there is also

  • tool*_clusters_sigma : The standard deviation of the average shape under the IoU metric

Return type:

dict


Shape Reducer OPTICS

This module provides functions to cluster shapes extracted with panoptes_aggregation.extractors.shape_extractor.

panoptes_aggregation.reducers.shape_reducer_optics.shape_reducer_optics(data_by_tool, **kwargs)

Cluster a shape by tool using OPTICS

Parameters:
  • data_by_tool (dict) – A dictionary returned by process_data()

  • metric_type (str) – Either “euclidean” to use a euclidean metric in the N-dimension shape parameter space or “IoU” for the intersection of union metric based on shape overlap. The IoU metric can only be used with the following shape:

    • rectangle

    • rotateRectangle

    • circle

    • ellipse

  • kwargsSee OPTICS

Returns:

reduction – A dictionary with the following keys for each frame

  • tool*_<shape>_<param> : A list of all param for the shape drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all shapes drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_<param> : The param value for each cluster found

If the “IoU” metric type is used there is also

  • tool*_clusters_sigma : The standard deviation of the average shape under the IoU metric

Return type:

dict


Shape Reducer HDBSCAN

This module provides functions to cluster shapes extracted with panoptes_aggregation.extractors.shape_extractor.

panoptes_aggregation.reducers.shape_reducer_hdbscan.shape_reducer_hdbscan(data_by_tool, **kwargs)

Cluster a shape by tool using HDBSCAN

Parameters:
  • data_by_tool (dict) – A dictionary returned by process_data()

  • metric_type (str) – Either “euclidean” to use a euclidean metric in the N-dimension shape parameter space or “IoU” for the intersection of union metric based on shape overlap. The IoU metric can only be used with the following shape:

    • rectangle

    • rotateRectangle

    • circle

    • ellipse

  • kwargsSee HDBSCAN

Returns:

reduction – A dictionary with the following keys for each frame

  • tool*_<shape>_<param> : A list of all param for the shape drawn with tool*

  • tool*_cluster_labels : A list of cluster labels for all shapes drawn with tool*

  • tool*_cluster_probabilities: A list of cluster probabilities for all points drawn with tool*

  • tool*_clusters_count : The number of points in each cluster found

  • tool*_clusters_<param> : The param value for each cluster found

If the “IoU” metric type is used there is also

  • tool*_clusters_sigma : The standard deviation of the average shape under the IoU metric

Return type:

dict


Polygon/Freehand Tool Reducer Using DBSCAN

This module provides functions to reduce the polygon extractions from both panoptes_aggregation.extractors.polygon_extractor and panoptes_aggregation.extractors.bezier_extractor using the algorithm DBSCAN.

All polygons are assumed to be closed. Any unclosed polygons will be closed.

panoptes_aggregation.reducers.polygon_reducer.polygon_reducer(data_by_tool, **kwargs_dbscan)

Cluster a polygon/freehand/Bezier tools using DBSCAN.

There is a choice in how the cluster is averaged into a single cluster, with the varies choices listed below.

A custom “IoU” metric type is used to measure the distance between the polygons.

Parameters:
  • data_by_tool (dict) – A dictionary returned by process_data()

  • kwargs

    • See DBSCAN

    • average_type : Must be either “union”, which returns the union of the cluster, “intersection” which returns the intersection of the cluster, “last”, which returns the last polygon to be created in the cluster, or “median”, which returns the polygon with the minimum total distance to the other polygons. Defaults to “median”.

    • created_at : A list of when the classifcations were made.

Returns:

reduction – A dictionary with the following keys for each frame, task and tool:

  • tool*_cluster_labels : A list of cluster labels for polygons provided for this frame and tool

  • tool*_clusters_count : The number of points in each cluster found for this frame and tool

  • tool*_clusters_x : A list of the x values of each cluster

  • tool*_clusters_y : A list of the y values of each cluster

  • tool*_consensus : A list of the the overall consensus of each cluster. A value of 1 is perfect agreement, a value of 0 is complete disagreement. This is found by subtracting`IoU_cluster_mean_distance` from 1

Return type:

dict

panoptes_aggregation.reducers.polygon_reducer.process_data(data)

Process a list of extractions into a dictionary organized by frame, Task and tool.

This also closes and simplifies the polygons.

Parameters:

data (list) – A list of extractions created by panoptes_aggregation.extractors.polygon_extractor() or panoptes_aggregation.extractors.bezier_extractor()

Returns:

data_by_tool – A dictionary with one key for each frame of the subject and each tool used for the classification. The value for each key is a dictionary with two keys X and data. X is a 2D array with each row mapping to the data held in data. The first column contains row indices and the second column is an index assigned to each user. data is a list of dictionaries, which contains the polygon data to be reduced. It is of the form {‘polygon’: shapely.geometry.polygon.Polygon, ‘gold_standard’: bool}.

Return type:

dict


Polygon/Freehand Tool Reducer Using DBSCAN - Contours

This module is an extension of panoptes_aggregation.reducers.polygon_reducer to provide the contours of intersection/overlap. These can be used to estimate the cluster average and its uncertainty.

All polygons are assumed to be closed. Any unclosed polygons will be closed.

Note, this reduction is one cluster per row.

panoptes_aggregation.reducers.polygon_reducer_contours.polygon_reducer_contours(data_by_tool, **kwargs_dbscan)

Cluster a polygon/freehand/Bezier tools using DBSCAN, then find the contours of this cluster.

The contours are defined by the overlap/intersection of the polygons in the cluster. Each contour is the union of at least the number of intersections of its position in the list. E.g. the second contour is the largest polygon/area of at least two volunteers agreeing, the third is at least three volunteers etc.

A custom “IoU” metric type is used.

This reduction will take much longer than panoptes_aggregation.reducers.polygon_reducer. As it retruns a list rather than a dictionary this may cause issues with any subsequent data processing with Caesar.

The default method for finding the contours is slow but accurate. However, the algorithm time per cluster increases approximately exponentially with number of polygons in the cluster. Therefore, for cases with clusters of many polygons, a more effcient but less accurate rasterisation based approach is used. This can be used instead of the default setting the kwarg rasterisation to True.

Parameters:
  • data_by_tool (dict) – A dictionary returned by panoptes_aggregation.reducers.process_data

  • average_type (str) – Either “union”, which returns the union of the cluster, “intersection” which retruns the intersection of the cluster, “last”, which returns the last polygon to be annotated in the cluster or “median”, which returns the polygon with minimal IoU distance to the other polygons of the cluster.

  • kwargs

    • rasterisation/rasterization: String/boolean. If True the contours are found using rasterisation, if False intersections are used. Defaults to ‘auto’, which uses rasterisation if more than 9 in the cluster.

    • num_grid_points: An integer which defines the number of grid points per axis when rasterisation is True. A higher number results in more accuracy but also increases computational time. Defaults to 100.

    • smoothing: A string to choose the type of smoothing used for rasterisation (if used). If ‘minimal_sides’, the number of sides of the contour is minimised. If ‘rounded’, corners are rounded. If ‘no_smoothing’, no smoothing is done. Defaults to ‘minimal_sides’.

    • See DBSCAN

Returns:

reduction – A list of dictionaries. Each dictionary has following keys for each frame, task and tool:

  • tool*_cluster_labels : A list of cluster labels for polygons provided. This is for all of the clusters for this frame and tool

  • tool*_cluster_label_for_contours : The index of the cluster whose contours are listed, corresponding to the labels in tool*_cluster_labels

  • tool*_number_of_contours : The number of contours of the cluster

  • tool*_contours_x : A list of the x values of each contour

  • tool*_contours_y : A list of the y values of each contour

  • tool*_consensus : A list of the the overall consensus of each cluster. A value of 1 is perfect agreement, a value of 0 is complete disagreement. This is found by subtracting`IoU_cluster_mean_distance` from 1

Return type:

list


Survey Reducer

This module provides functions to reduce survey task extracts from panoptes_aggregation.extractors.survey_extractor.

panoptes_aggregation.reducers.survey_reducer.process_data(data)

Process a list of extracted survey data into a dictionary of sub-question answers sorted organized by choice

Parameters:

data (list) – A list of extractions created by panoptes_aggregation.extractors.survey_extractor.survey_extractor()

Returns:

processed_data – A dictionary where the keys are the choice made and the values are a list of dicts containing Counters for each sub-question asked.

Return type:

dict

panoptes_aggregation.reducers.survey_reducer.survey_reducer(data_in)

Reduce the survey task answers as a list of dicts (one for each choice marked)

Parameters:

data_in (dict) – A dictionary created by process_data()

Returns:

reduction – A list that has one element for choice marked. Each element is a dict of the form

  • choice : The choice made

  • total_vote_count : The number of users that classified the subject

  • choice_count : The number of users that made this choice

  • answers_* : Counters for each answer to sub-question *

Return type:

list


Polygon As Line Tool for Text Reducer

This module provides functions to reduce the polygon-text extractions from panoptes_aggregation.extractors.poly_line_text_extractor.

panoptes_aggregation.reducers.poly_line_text_reducer.poly_line_text_reducer(data_by_frame, **kwargs_dbscan)

Reduce the polygon-text answers as a list of lines of text.

Parameters:
  • data_by_frame (dict) – A dictionary returned by process_data()

  • kwargs

    • See DBSCAN

    • eps_slope : How close the angle of two lines need to be in order to be placed in the same angle cluster.

    • eps_line : How close vertically two lines need to be in order to be identified as the same line.

    • eps_word : How close horizontally the end points of a line need to be in order to be identified as a single point.

    • gutter_tol : How much neighboring columns can overlap horizontally and still be identified as multiple columns.

    • dot_freq : “line” if dots are drawn at the start and end point of a line, “word” if dots are drawn between each word. Note: “word” was proposed for a project but was never used, I don’t expect it ever will. This will likely be depreciated in a future release.

    • min_samples : For all clustering stages this is how many points need to be close together for a cluster to be identified. Set this to 1 for all annotations to be kept

    • min_word_count : The minimum number of times a word must be identified for it to be kept in the consensus text.

    • low_consensus_threshold : The minimum consensus score allowed to be considered “done”

    • minimum_views : A value that is passed along to the font-end to set when lines should turn grey (has no effect on aggregation)

Returns:

reduction – A dictionary with on key for each frame of the subject that have lists as values. Each item of the list represents one line transcribed of text and is a dictionary with these keys:

  • clusters_x : the x position of each identified word

  • clusters_y : the y position of each identified word

  • clusters_text : A list of text at each cluster position

  • gutter_label : A label indicating what “gutter” cluster the line is from

  • line_slope: The slope of the line of text in degrees

  • slope_label : A label indicating what slope cluster the line is from

  • number_views : The number of users that transcribed the line of text

  • consensus_score : The average number of users who’s text agreed for the line. Note, if consensus_score is the same a number_views every user agreed with each other

  • low_consensus : True if the consensus_score is less than the threshold set by the low_consensus_threshold keyword

For the entire subject the following is also returned: * low_consensus_lines : The number of lines with low consensus * transcribed_lines : The total number of lines transcribed on the subject

Note: the image coordiate system has y increasing downward.

Return type:

dict

panoptes_aggregation.reducers.poly_line_text_reducer.process_data(data_list, process_by_line=False)

Process a list of extractions into a dictionary of loc and text organized by frame

Parameters:

data_list (list) – A list of extractions created by panoptes_aggregation.extractors.poly_line_text_extractor.poly_line_text_extractor()

Returns:

processed_data – A dictionary with keys for each frame of the subject and values being dictionaries with x, y, text, and slope keys. x, y, and text are list-of-lists, each inner list is from a single annotaiton, slope is the list of slopes (in deg) for each of these inner lists.

Return type:

dict


Text aggregation utilities

This module provides utility functions used in the polyton-as-line-text-reducer code from panoptes_aggregation.reducers.poly_line_text_reducer.

panoptes_aggregation.reducers.text_utils.align_words(word_line, xy_line, text_line, kwargs_cluster, kwargs_dbscan)

A function to take the annotations for one line of text, aligns the words, and finds the end-points for the line.

Parameters:
  • word_line (np.array) – An nx1 array with the x-position of each dot in the rotated coordinate frame.

  • xy_line (np.array) – An nx2 array with the non-rotated (x, y) positions of each dot.

  • text_line (np.array) – An nx1 array with the text for each dot.

  • gs_line (np.array) – An array of bools indicating if the annotation was made in gold standard mode

  • kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords

  • kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords

Returns:

  • clusters_x (list) – A list with the start and end x-position of the line

  • clusters_y (list) – A list with the start and end y-position of the line

  • clusters_text (list) – A list-of-lists with the words transcribed at each dot cluster found. One list per cluster. Note: the empty strings that were added to each annotaiton are stripped before returning the words.

panoptes_aggregation.reducers.text_utils.angle_metric(t1, t2)

A metric for the distance between angles in the [-180, 180] range

Parameters:
  • t1 (float) – Theta one in degrees

  • t2 (float) – Theta two in degrees

Returns:

distance – The distance between the two input angles in degrees

Return type:

float

panoptes_aggregation.reducers.text_utils.avg_angle(theta)

A function that finds the average of an array of angles that are in the range [-180, 180].

Parameters:

theta (array) – An array of angles that are in the range [-180, 180] degrees

Returns:

average – The average angle

Return type:

float

panoptes_aggregation.reducers.text_utils.cluster_by_gutter(x_slope, y_slope, text_slope, gs_slope, data_index_slope, ext_index_slope, kwargs_cluster, kwargs_dbscan)

A function to take the annotations for each frame of a subject and group them based on what side of the page gutter they are on.

Parameters:
  • x_slope (np.array) – A list-of-lists of the x values for each drawn dot. There is one item in the list for annotation made by the user.

  • y_slope (np.array) – A list-of-lists of the y values for each drawn dot. There is one item in the list for annotation made by the user.

  • text_slope (np.array) – A list-of-lists of the text for each drawn dot. There is one item in the list for annotation made by the user.

  • gs_slope (np.array) – A list of bools indicating if the annotation was made in gold standard mode

  • data_index_slope (np.array) – A list of indices indicating what classification each classification came from

  • ext_index_slope (np.array) – A list of extractor indices used to map the reduction to the extract

  • kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords

  • kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords

Returns:

frame_gutter – A list of the resulting extractions, one item per line of text found.

Return type:

list

panoptes_aggregation.reducers.text_utils.cluster_by_line(xy_rotate, xy_gutter, text_gutter, annotation_labels, gs_gutter, data_index_gutter, ext_index_gutter, kwargs_cluster, kwargs_dbscan)

A function to take the annotations for one slope_label and cluster them based on perpendicular distance (e.g. lines of text).

Parameters:
  • xy_rotate (np.array) – An array of shape nx2 containing the (x, y) positions of each dot drawn in the rotate coordinate frame.

  • xy_gutter (np.array) – An array of shape nx2 containing the (x, y) positions for each dot drawn.

  • text_gutter (np.array) – An array of shape nx1 containing the text for each dot drawn. Note: each annotation has an empty string added to the end so this array has the same shape as xy_slope.

  • annotation_labels (np.array) – An array of shape nx1 containing a unique label indicating what annotation each position/text came from. This information is used to ensure one annotation does not span multiple lines.

  • gs_gutter (np.array) – An array of bools indicating if the annotation was made in gold standard mode

  • data_index_gutter (np.array) – An array of indices indicating what classification each classification came from

  • ext_index_gutter (np.array) – A list of extractor indices used to map the reduction to the extract

  • kwargs_cluster (dict) – A dictionary containing the eps_*, and dot_freq keywords

  • kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords

Returns:

frame_lines – A list of reductions, one for each line. Each reduction is a dictionary containing the information for the line.

Return type:

list

panoptes_aggregation.reducers.text_utils.cluster_by_slope(x_frame, y_frame, text_frame, slope_frame, gs_frame, data_index_frame, ext_index_frame, kwargs_cluster, kwargs_dbscan)

A function to take the annotations for one gutter_label and cluster them based on what slope the transcription is.

Parameters:
  • x_frame (np.array) – A list-of-lists of the x values for each drawn dot. There is one item in the list for annotation made by the user.

  • y_frame (np.array) – A list-of-lists of the y values for each drawn dot. There is one item in the list for annotation made by the user.

  • text_frame (np.array) – A list-of-lists of the text for each drawn dot. There is one item in the list for annotation made by the user. The inner text lists are padded with an empty string at the end so there is the same number of words as there are dots.

  • slope_frame (np.array) – A list of the slopes (in deg) for each annotation

  • gs_frame (np.array) – A list of bools indicating if the annotation was made in gold standard mode

  • data_index_frame (np.array) – A list of indices indicating what classification each classification came from

  • ext_index_frame (np.array) – A list of extractor indices used to map the reduction to the extract

  • kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords

  • kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords

Returns:

frame_slope – A list of the resulting extractions, one item per line of text found.

Return type:

list

panoptes_aggregation.reducers.text_utils.cluster_by_word(word_line, xy_line, text_line, annotation_labels, kwargs_cluster, kwargs_dbscan)

A function to take the annotations for one line of text and cluster them based on the words in the line.

Parameters:
  • word_line (np.array) – An nx1 array with the x-position of each dot in the rotated coordinate frame.

  • xy_line (np.array) – An nx2 array with the non-rotated (x, y) positions of each dot.

  • text_line (np.array) – An nx1 array with the text for each dot.

  • annotation_labels (np.array) – An nx1 array with a label indicating what annotaiton each word belongs to.

  • kwargs_cluster (dict) – A dictionary containing the eps_* and dot_freq keywords

  • kwargs_dbscan (dict) – A dictionary containing all the other DBSCAN keywords

Returns:

  • clusters_x (list) – A list with the x-position of each dot cluster found

  • clusters_y (list) – A list with the y-position of each dot cluster found

  • clusters_text (list) – A list-of-lists with the words transcribed at each dot cluster found. One list per cluster. Note: the empty strings that were added to each annotaiton are stripped before returning the words.

panoptes_aggregation.reducers.text_utils.consensus_score(clusters_text)

A function to take clustered text data and return the consensus score

Parameters:

clusters_text (list) – A list-of-lists with length equal to the number of words in a line of text and each inner list contains the transcriptions for each word.

Returns:

  • consensus_score (float) – A value indicating the average number of users that agree on the line of text.

  • consensus_text (str) – A string with the consensus sentence

panoptes_aggregation.reducers.text_utils.gutter(lines_in, tol=0)

Cluster list of input line segments by what side of the page gutter they are on.

Parameters:

lines_in (list) – A list-of-lists containing one line segment per item. Each line segment should contain only the x-coordinate of each point on the line.

Returns:

gutter_index – A numpy array containing the cluster label for each input line. This label indicates what side of the gutter(s) the input line segment is on.

Return type:

array

panoptes_aggregation.reducers.text_utils.overlap(x, y, tol=0)

Check if two line segments overlap

Parameters:
  • x (list) – A list with the start and end point of the first line segment

  • y (lits) – A list with the start and end point of the second line segment

  • tol (float) – The tolerance to consider lines overlapping. Default 0, positive value indicate small overlaps are not considered, negative values indicate small gaps are not considered.

Returns:

overlap – True if the two line segments overlap, False otherwise

Return type:

bool

panoptes_aggregation.reducers.text_utils.sort_labels(db_labels, data, reducer=<function mean>, descending=False)

A function that takes in the cluster lables for some data and returns a sorted (by the original data) list of the unique lables in.

Parameters:
  • db_labels (list) – A list of cluster lables, one label for each data point.

  • data (np.array) – The data the lables belong to

  • reducer (function (optional)) – The function used to combine the data for each label. Default: np.mean

  • descending (bool (optional)) – A flag indicating if the lables should be sorted in descending order. Default: False

Returns:

lables – A list of unique cluster lables sorted in either ascending or descending order.

Return type:

list

panoptes_aggregation.reducers.text_utils.tokenize(self, contents)

Tokenize only on space so angle bracket tags are not split


Shakespeares World Variants Reducer

This module provides a fuction to reduce the variants data from extracts.

panoptes_aggregation.reducers.sw_variant_reducer.sw_variant_reducer(extracts)

Reduce all variants for a subject into one list

Parameters:

extracts (list) – A list of extracts created by panoptes_aggregation.extractors.sw_variant_extractor.sw_variant_extractor()

Returns:

reduction – A dictionary with at most one key, variants with the list of all variants in the subject

Return type:

dict


panoptes_aggregation.reducers.dropdown_reducer.dropdown_reducer(votes_list)

Reducer a list-of-lists of Counter objects into one list of dicts

Parameters:

votes_list (list) – A list-of-lists of Counter objects from process_data()

Returns:

reduction – A dictionary with one key value the contains a list of dictionaries (one for each dropdown in the task) giving the vote count for each key

Return type:

dict

panoptes_aggregation.reducers.dropdown_reducer.process_data(data)

Process a list of extracted dropdown answers into Counter objects

Parameters:

data (list) – A list of extractions created by panoptes_aggregation.extractors.dropdown_extractor.dropdown_extractor()

Returns:

process_data – A list-of-lists of Counter objects. The is one element of the outer list for each classification made, and one element of the inner list for each dropdown list in the task.

Return type:

list


TESS Column Reducer

This module provides functions to reduce the column task extracts for the TESS project. Extracts are from panoptes_aggregation.extractors.shape_extractor.

panoptes_aggregation.reducers.tess_reducer_column.process_data(data, **kwargs_extra_data)

Process a list of extractions into lists of x and y sorted by tool

Parameters:

data (list) – A list of extractions crated by panoptes_aggregation.extractors.shape_extractor.shape_extractor()

Returns:

processed_data – A dictionary with two keys

  • data: An Nx2 numpy array containing the center and width of each column drawn

  • index: A list of length N indicating the extract index for each drawn column

Return type:

dict

panoptes_aggregation.reducers.tess_reducer_column.tess_reducer_column(data_by_tool, **kwargs)

Cluster TESS columns using DBSCAN

Parameters:
Returns:

reduction – A dictionary with the following keys

  • centers : A list with the center x position for all identified columns

  • widths : A list with the full width of all identified columns

  • counts : A list with the number of volunteers who identified each column

  • weighted_counts : A list with the weighted number of volunteers who identified each column

  • user_ids: A list of lists with the user_id for each volunteer who marked each column

  • max_weighted_counts: The largest likelihood of a transit for this subject

Return type:

dict


TESS Gold Standard Reducer

This module porvides functions to reduce the gold standard task extracts for the TESS project.

panoptes_aggregation.reducers.tess_gold_standard_reducer.process_data(extracts)

Process the feedback extracts

Parameters:

extracts (list) – A list of extracts from Caesar’s pluck field extractor

Returns:

success – A list-of-lists, one list for each classification with booleans indicating the volunteer’s success at finding each gold standard transit in a subject.

Return type:

list

panoptes_aggregation.reducers.tess_gold_standard_reducer.tess_gold_standard_reducer(data)

Calculate the difficulty of a gold standard TESS subject

Parameters:

data (list) – The results of process_data()

Returns:

output – A dictinary with one key difficulty that is a list with the fraction of volunteers who successfully found each gold standard transit in a subject.

Return type:

dict


Utilities for polygon_reducer

This module provides utilities used to reduce the polygon extractions for panoptes_aggregation.reducers.polygon_reducer.

panoptes_aggregation.reducers.polygon_reducer_utils.IoU_cluster_mean_distance(distances_matrix)

The mean IoU_metric_polygon distance between the polygons of the cluster.

Parameters:

distances_matrix (numpy.ndarray) – A symmetric-square array, with the off-diagonal elements containing the IoU_metric_polygon distance between the cluster members. The diagonal elements are all zero. This is found using IoU_distance_matrix_of_cluster.

Returns:

distances_mean – The mean of the IoU_metric_polygon defined distance between the polygons of the cluster.

Return type:

float

panoptes_aggregation.reducers.polygon_reducer_utils.IoU_distance_matrix_of_cluster(cdx, X, data)

Find distance matrix using IoU_metric_polygon for a cluster.

The cdx argument is used to define the cluster out of the full X and data data sets, which may also contain other polygons not in the cluster.

Parameters:
  • cdx (numpy.ndarray) – A 1D array of booleans, corresponding to the polygons in X and data which are in the cluster. True if in the cluster, False otherwise.

  • X (numpy.ndarray) – A 2D array with each row mapping to the data held in data. The first column contains row indices and the second column is an index assigned to each user.

  • data_cluster (list) – A list of dicts that take the form {polygon: shapely.geometry.polygon.Polygon, ‘time’: float, ‘gold_standard’, bool} There is one element in this list for each member of the cluster.

Returns:

distances_matrix – A symmetric-square array, with the off-diagonal elements containing the IoU distance between the cluster members. The diagonal elements are all zero.

Return type:

numpy.ndarray

panoptes_aggregation.reducers.polygon_reducer_utils.IoU_metric_polygon(a, b, data_in=[])

Find the Intersection of Union distance between two polygons. This is based on the Jaccard metric

To use this metric within the clustering code without having to precompute the full distance matrix a and b are index mappings to the data contained in data_in. a and b also contain the user information that is used to help prevent self-clustering. The polygons used to calculate the IoU distance are contained in data_in, along with the timestamp of creation.

Parameters:
  • a (list) – A two element list containing [index mapping to data, index mapping to user]

  • b (list) – A two element list containing [index mapping to data, index mapping to user] A list of the parameters for shape 2 (as defined by PFE)

  • data_in (list) – A list of dicts that take the form {polygon: shapely.geometry.polygon.Polygon, ‘time’: float, ‘gold_standard’, bool} There is one element in this list for each classification made. The time should be a Unix timestamp float.

Returns:

distance – The IoU distance between the two polygons. 0 means the polygons are the same, 1 means the polygons don’t overlap, values in the middle mean partial overlap.

Return type:

float

panoptes_aggregation.reducers.polygon_reducer_utils.cluster_average_intersection(data, **kwargs)

Find the intersection of provided cluster data

Parameters:
  • data (list) – A list of dicts that take the form {polygon: shapely.geometry.polygon.Polygon, ‘gold_standard’, bool} There is one element in this list for each classification made.

  • kwargs

    • created_at : A list of when the classifcations was made. Not used in this average.

    • distance_matrix : A symmetric-square array, with the off-diagonal elements containing the IoU_metric_polygon distance between the cluster members. The diagonal elements are all zero. This is found using IoU_distance_matrix_of_cluster. Not used in this average.

Returns:

intersection_all – The shapely intersection of the shapely polygons in the cluster.

Return type:

shapely.geometry.polygon.Polygon

panoptes_aggregation.reducers.polygon_reducer_utils.cluster_average_intersection_contours(data, **kwargs)

Find contours of intersection as a list. Each item of the list will be the largest contour of i intersections, with the next item being the contour i+1 intersection etc. The intersection is where the polygons overlap. This is useful for plotting the uncertainty in the cluster.

The algorithm used is as follows. First find the largest simply-connected union polygon for the cluster and add it to the list intersection_contours. Next, every intersection of two polygons is found, and made into new shapely polygons. This makes a list of ‘level-2’ polygons. These polygons may overlap. Then, find the largest simply-connected union polygon of the level-2 polygons. This is the polygon of at least 2 intersections (i.e. area where at least 2 volunteers agree). Add it to list intersection_contours.

If there is more than one level-2 polygons, which intersect, then the intersection of the level-2 polygons is found as a list. These are the level-3 polygons, as each polygon is made from at least three intersections. Then find the largest simply-connected union polygon of the level-3 polygons. This is the polygon of at least 3 intersections (i.e. area where at least 3 volunteers agree). Add it to list intersection_contours.

Continue this process until either 10 iterations have been done, or only one unique intersection polygon remains.

Parameters:
  • data (list) – A list of dicts that take the form {polygon: shapely.geometry.polygon.Polygon, ‘gold_standard’, bool} There is one element in this list for each classification made.

  • kwargs

    • created_at : A list when the classifcation was made. Not used in this average.

    • distance_matrix : A symmetric-square array, with the off-diagonal elements containing the IoU_metric_polygon distance between the cluster members. The diagonal elements are all zero. This is found using IoU_distance_matrix_of_cluster. Not used in this average.

Returns:

intersection_contours – List of shapely objects. Each shape at position i in the list is the largest simply-connected contour of at least i intersections.

Return type:

list

panoptes_aggregation.reducers.polygon_reducer_utils.cluster_average_intersection_contours_rasterisation(data, **kwargs)

Find contours of intersection as a list. Each item of the list will be the largest contour of i intersections, with the next item being the contour i+1 intersection etc. The intersection is where the polygons overlap. This is useful for plotting the uncertainty in the cluster.

This approach uses rasterisation to find the contours. A square grid, with the number of grid points along each of the two axis given by num_grid_points, is placed over the cluster. Then the number of polygon intersections in each grid square are counted. Contours are then made from this 2D surface of intersection counts.

This function has the advantage of being more efficient than cluster_average_intersection_contours when the number of polygons in the cluster is large (approximately when greater than 8). Equally if num_grid_points is small, say 10, then rasterisation is faster in most cases but gives poorer quality contours with increased risk of grid-spacing based artifacts.

The resulting contours are smoothed by default.

Parameters:
  • data (list) – A list of dicts that take the form {polygon: shapely.geometry.polygon.Polygon, ‘gold_standard’, bool} There is one element in this list for each classification made.

  • kwargs

    • num_grid_points: The number of grid points per axis. A larger number means greater resolution, but takes longer. Default is 100.

    • smoothing: A string to choose the type of smoothing used. If ‘minimal_sides’, the number of sides of the contour is minimised. If ‘rounded’, corners are rounded. If ‘no_smoothing’, no smoothing is done. Defaults to ‘minimal_sides’.

    • created_at : A list when the classifcation was made. Not used in this average.

    • distance_matrix : A symmetric-square array, with the off-diagonal elements containing the IoU_metric_polygon distance between the cluster members. The diagonal elements are all zero. This is found using IoU_distance_matrix_of_cluster. Not used in this average.

Returns:

intersection_contours – List of shapely objects. Each shape at position i in the list is the largest simply-connected contour of at least i intersections.

Return type:

list

panoptes_aggregation.reducers.polygon_reducer_utils.cluster_average_last(data, **kwargs)

Find the last created polygon of provided cluster data

Parameters:
  • data (list) – A list of dicts that take the form {polygon: shapely.geometry.polygon.Polygon, ‘gold_standard’, bool} There is one element in this list for each classification made. The time should be a Unix timestamp float.

  • kwargs

    • created_at : A list of when the classifcations was made.

    • distance_matrix : A symmetric-square array, with the off-diagonal elements containing the IoU_metric_polygon distance between the cluster members. The diagonal elements are all zero. This is found using IoU_distance_matrix_of_cluster. Not used in this average.

Returns:

last – The last created shapely polygon in the cluster.

Return type:

shapely.geometry.polygon.Polygon

panoptes_aggregation.reducers.polygon_reducer_utils.cluster_average_median(data, **kwargs)

Find the ‘median’ of provided cluster data, i.e. the polygon of the cluster with the minimum total distance to the other polygons.

Parameters:
  • data (list) – A list of dicts that take the form {polygon: shapely.geometry.polygon.Polygon, ‘gold_standard’, bool} There is one element in this list for each classification made.

  • kwargs

    • created_at : A list when the classifcation was made. Not used in this average.

    • distance_matrix : A symmetric-square array, with the off-diagonal elements containing the IoU_metric_polygon distance between the cluster members. The diagonal elements are all zero. This is found using IoU_distance_matrix_of_cluster.

Returns:

median – The ‘median’ polygon in the cluster.

Return type:

shapely.geometry.polygon.Polygon

panoptes_aggregation.reducers.polygon_reducer_utils.cluster_average_union(data, **kwargs)

Find the union of provided cluster data

Parameters:
  • data (list) – A list of dicts that take the form {polygon: shapely.geometry.polygon.Polygon, ‘gold_standard’, bool} There is one element in this list for each classification made.

  • kwargs

    • created_at : A list when the classifcation was made. Not used in this average.

    • distance_matrix : A symmetric-square array, with the off-diagonal elements containing the IoU_metric_polygon distance between the cluster members. The diagonal elements are all zero. This is found using IoU_distance_matrix_of_cluster. Not used in this average.

Returns:

union_all – The shapely union of the shapely polygons in the cluster.

Return type:

shapely.geometry.polygon.Polygon


Utilities for optics_line_text_reducer

This module provides utilities used to reduce the polygon-text extractions for panoptes_aggregation.reducers.optics_line_text_reducer. It assumes that all extracts are full lines of text in the document.

panoptes_aggregation.reducers.optics_text_utils.cluster_of_one(X, data, user_ids, extract_index)

Create “clusters of one” out of the data passed in. Lines of text identified as noise are kept around as clusters of one so they can be displayed in the front-end to the next user.

Parameters:
  • X (list) – A nx2 list with each row containing [index mapping to data, index mapping to user]

  • data (list) – A list containing dictionaries with the original data that X maps to, of the form {‘x’: [start_x, end_x], ‘y’: [start_y, end_y], ‘text’: [‘text for line’], ‘gold_standard’: bool}.

  • user_ids (list) – A list of user_ids (The second column of X maps to this list)

  • extract_index (list) – A list of n values with the extract index for each of rows in X

Returns:

clusters – A list with n clusters each containing only one classification

Return type:

list

panoptes_aggregation.reducers.optics_text_utils.get_min_samples(N)

Get the min_samples attribute based on the number of users who have transcribed the subject. These values were found based on example data from ASM.

Parameters:

N (integer) – The number of users who have see the subject

Returns:

min_samples – The value to use for the min_samples keyword in OPTICS

Return type:

integer

panoptes_aggregation.reducers.optics_text_utils.metric(a, b, data_in=[])

Calculate the distance between two drawn lines that have text associated with them. This distance is found by summing the euclidean distance between the start points of each line, the euclidean distance between the end points of each line, and the Levenshtein distance of the text for each line. The Levenshtein distance is done after stripping text tags and consolidating whitespace.

To use this metric within the clustering code without haveing to precompute the full distance matrix a and b are index mappings to the data contained in data_in. a and b also contain the user information that is used to help prevent self-clustering.

Parameters:
  • a (list) – A two element list containing [index mapping to data, index mapping to user]

  • b (list) – A two element list containing [index mapping to data, index mapping to user]

  • data_in (list) – A list of dicts that take the form {x: [start_x, end_x], y: [start_y, end_y], ‘text’: [‘text for line’], ‘gold_standard’, bool} There is one element in this list for each classification made.

Returns:

distance – The distance between a and b

Return type:

float

panoptes_aggregation.reducers.optics_text_utils.order_lines(frame_in, angle_eps=30, gutter_eps=150)

Place the identified lines within a single frame in reading order

Parameters:
  • frame (list) – A list of identified transcribed lines (one frame from panoptes_aggregation.reducers.optics_line_text_reducer.optics_line_text_reducer)

  • angle_eps (float) – The DBSCAN eps value to use for the slope clustering

  • gutter_eps (float) – The DBSCAN eps value to use for the column clustering

Returns:

frame_ordered – The identified transcribed lines in reading order. The slope_label and gutter_label values are added to each line to indicate what cluster it belongs to.

Return type:

list

panoptes_aggregation.reducers.optics_text_utils.remove_user_duplication(labels_, core_distances_, users)

Make sure a users only shows up in a cluster at most once. If a user does show up more than once in a cluster take the point with the smallest core distance, all others are assigned as noise (-1).

Parameters:
  • labels_ (numpy.array) – A list containing the cluster labels for each data point

  • core_distances_ (numpy.array) – A list of core distance for each data point

  • users (numpy.array) – A list of indices that map to users, one for each data point

Returns:

clean_labels_ – A list containing the new cluster labels.

Return type:

numpy.array

panoptes_aggregation.reducers.optics_text_utils.strip_tags(s)

Remove square bracket tags from text and consolidating whitespace

Parameters:

s (string) – The input string

Returns:

clean_s – The cleaned string

Return type:

string


Line Tool with Text Subtask Reducer using OPTICS

This module provides functions to reduce the polygon-text extractions from panoptes_aggregation.extractors.poly_line_text_extractor using the density independent clustering algorithm OPTICS. It is assumed that all extracts are full lines of text in the document.

panoptes_aggregation.reducers.optics_line_text_reducer.optics_line_text_reducer(data_by_frame, **kwargs_optics)

Reduce the line-text extracts as a list of lines of text.

Parameters:
  • data_by_frame (dict) – A dictionary returned by process_data()

  • kwargs

    • See OPTICS

    • min_samples : The smallest number of transcribed lines needed to form a cluster. auto will set this value based on the number of volunteers who transcribed on a page within a subject.

    • xi : Determines the minimum steepness on the reachability plot that constitutes a cluster boundary.

    • angle_eps : How close the angle of two lines need to be in order to be placed in the same angle cluster. Note: This will only change the order of the lines.

    • gutter_eps : How close the x position of the start of two lines need to be in order to be placed in the same column cluster. Note: This will only change the order of the lines.

    • min_line_length : The minimum length a transcribed line of text needs to be in order to be used in the reduction.

    • low_consensus_threshold : The minimum consensus score allowed to be considered “done”.

    • minimum_views : A value that is passed along to the font-end to set when lines should turn grey (has no effect on aggregation)

Returns:

reduction – A dictionary with on key for each frame of the subject that have lists as values. Each item of the list represents one line transcribed of text and is a dictionary with these keys:

  • clusters_x : the x position of each identified word

  • clusters_y : the y position of each identified word

  • clusters_text : A list of lists containing the text at each cluster position There is one list for each identified word, and each of those lists contains one item for each user that identified the cluster. If the user did not transcribe the word an empty string is used.

  • line_slope: The slope of the line of text in degrees

  • number_views : The number of users that transcribed the line of text

  • consensus_score : The average number of users who’s text agreed for the line Note, if consensus_score is the same a number_views every user agreed with each other

  • user_ids: List of panoptes user ids in the same order as clusters_text

  • gold_standard: List of bools indicating of the if a transcription was made in frontends gold standard mode

  • slope_label: integer indicating what slope cluster the line belongs to

  • gutter_label: integer indicating what gutter cluster (i.e. column) the line belongs to

  • low_consensus : True if the consensus_score is less than the threshold set by the low_consensus_threshold keyword

For the entire subject the following is also returned: * low_consensus_lines : The number of lines with low consensus * transcribed_lines : The total number of lines transcribed on the subject

Note: the image coordinate system has y increasing downward.

Return type:

dict

panoptes_aggregation.reducers.optics_line_text_reducer.process_data(data_list, min_line_length=0.0)

Process a list of extractions into a dictionary organized by frame

Parameters:

data_list (list) – A list of extractions created by panoptes_aggregation.extractors.poly_line_text_extractor.poly_line_text_extractor()

Returns:

processed_data – A dictionary with one key for each frame of the subject. The value for each key is a dictionary with two keys X and data. X is a 2D array with each row mapping to the data held in data. The first column contains row indices and the second column is an index assigned to each user. data is a list of dictionaries of the form {‘x’: [start_x, end_x], ‘y’: [start_y, end_y], ‘text’: [‘text for line’], ‘gold_standard’: bool}.

Return type:

dict


Text Tool Reducer

This module provides functions to reducer the panoptes text tool into an alignment table.

panoptes_aggregation.reducers.text_reducer.process_data(data_list)

Flatten list of extracts into a list of strings. Empty strings are not returned

panoptes_aggregation.reducers.text_reducer.text_reducer(data_in, **kwargs)

Reduce a list of text into an alignment table :Parameters: data (list) – A list of strings to be aligned

Returns:

reduction – A dictionary with the following keys:

  • aligned_text: A list of lists containing the aligned text. There is one list for each identified word, and each of those lists contains one item for each user that entered text. If the user did not transcribe a word an empty string is used.

  • number_views: Number of volunteers who entered non-blank text

  • consensus_score: The average number of users who’s text agreed. Note, if consensus_score is the same a number_views every user agreed with each other

Return type:

dict


First N True Reducer

This module is designed to reduce boolean-valued extracts e.g. panoptes_aggregation.extractors.all_tasks_empty_extractor. It returns true if and only if the first N extracts are True.

panoptes_aggregation.reducers.first_n_true_reducer.first_n_true_reducer(data_list, n=0, **kwargs)

Reduce a list of boolean values to a single boolean value.

Parameters:
  • data_list (list) – A list of dicts containing a “result” key which should correspond with a boolean value.

  • n (int) – The first n results in data_list must be True.

Returns:

reductionreduction[“result”] is True if the first n results in data_list are True. Otherwise False.

Return type:

dict