Target Text Extractor

Introduction

Target Text Extractor is a software tool that enables the user to find and extract certain parts of a text and perform a few additional operations with the result. There are numerous situations where the Target Text Extractor can be helpful, both in scholarly work and everyday life. With Target Text Extractor, a biologist can extract all gene names from a genome annotation file, a programmer can extract text literals from a source code file, a soccer fun can find results of a particular team in a listing of all matches, etc. The app comes in two versions/modes - simple and advanced. The simple mode is addressed to general public and requires no knowledge of the programming tool called regular expressions. The advanced mode requires the knowledge of regular expressions, at least in some situations. However, the concept of regular expressions is relatively simple and can be quickly adopted by an interested user without any previous programming experience. In the simple mode, all text entered in the "PRECEDED BY" or "FOLLOWED BY" fields will be understood by the app just as it appears. In the advanced mode, the app will try to interpret the entered text as a regular expression, i.e. it will search for so called special characters that define text patterns instead of simple text (e.g. "\d" defines "any digit").

Simple Mode

In the simple mode, the user is supposed to:

  • Paste a text in the "Input Text" field or to open a plain text file containing the text.
  • Specify the desired target text by filling one or more of the "PRECEDED BY" or "FOLLOWED BY" fields.
  • The target text specification can be further narrowed down by checking specific groups of characters instead of "any characters" in the "TEXT TO EXTRACT CONSISTS OF" field (the choice of any characters must be unchecked first).

The text extraction procedure is triggered by clicking the Search button. For the operation success, it is critical to properly set the options located in the lower part of the Settings panel, i.e. whether the search will be case-sensitive, whether the text used for the search will be itself part of the results, and how the program should handle ambiguous and multiple matches. Ambiguous matches happen when the text defining either start or end of the match occurs more than once (in one line). In such situations the program needs to know whether it should extract the shortest or the longest possible match. Multiple matches are non-nested, non-overlapping occurrences of the text pattern defined by the "PRECEDED BY", "FOLLOWED BY" and "TEXT TO EXTRACT CONSISTS OF" fields. The program can either extract the first match only or all non-overlapping matches.

Advanced Mode

In the advanced mode, the basic workflow is the same as in the simple mode, while there are more options and by default all input character strings in the Settings panel will be interpreted as regular expressions. This behavior can be changed by checking the "auto-escape special characters" checkboxes located next to the "PRECEDED BY" or "FOLLOWED BY" fields. After that, the character strings will be searched for just as they appear. The expected character composition of the target text has to be defined in the "TEXT TO EXTRACT" section either by selecting one of the pre-defined options from the drop-down menu or by defining your own extraction pattern after selection of one of the last two options. In addition to the settings described in the Simple Mode paragraph, the advanced mode allows to choose if the lines in results will start with numbers corresponding to the order of positives lines in the input, if negative lines will be reported, or whether whole negative lines should be output instead of matches (an inverse search).

Additional Operations

In addition to the target text searching and extraction, the app offers additional operations with input and output texts. The input text can be broken into lines by occurrences of certain character or a sequence of characters. This is done with the Break Line After Each button after filling the division character(s) in the adjacent field. As the text extraction procedure analyzes the text line by line it might be necessary to break the input text into lines. For example, a text paragraph can be broken into individual sentences using ". " (a dot followed by a space) or even into single words using " " (a space character). In the advanced mode the "break after characters" will be interpreted as a regular expression.

The output text can be modified with a few operations using the buttons located in the Output Text field header. Redundant identical lines can be removed with the Remove Duplicate Lines button. The Sort button will sort the output lines in alphabetical, descending order. To reverse the order, click the Reverse button. You can also change the order randomly with the Shuffle button. The Select All button will select the entire content of the Output Text field (equivalent to clicking into the field and pressing Ctrl+A keys). The content of the Output Text field can be saved in a text file by clicking the Save to File button. Finally, the content of the Output Text field can be all joined into one line of text with the Join All in One Line button. Before clicking the button, choose the desired separator in the adjacent input field.