Utils

PyDetex https://github.com/ppizarror/PyDetex

UTILS Module that contain all util methods and classes used in parsers and pipelines, from tex, language, and low-level.

class pydetex.utils.LangTexTextTags[source]

Stores the tex tags for several commands.

get(lang, tag)[source]

Retrieves a language tag value.

Parameters:
  • lang (str) – Language

  • tag (str) – Tag to retrieve

Return type:

str

Returns:

Value of the language’s tag

class pydetex.utils.ProgressBar(steps, size=15)[source]

Basic progress bar implementation.

detail_times()[source]

Print times.

Return type:

None

reset()[source]

Reset the steps.

Return type:

None

update(status='', print_total_time=True)[source]

Update the current status to a new step.

Parameters:
  • status (str) – Status text

  • print_total_time (bool) – Prints total computing time

Return type:

None

pydetex.utils.apply_tag_between_inside_char_command(s, symbols_char, tags)[source]

Apply tag between symbols.

For example, if symbols are ($, $) and tag is [1,2,3,4]:

Input: This is a $formula$ and this is not.
Output: This is a 1$2formula3$4 and this is not
Parameters:
Return type:

str

Returns:

String with tags

pydetex.utils.apply_tag_tex_commands(s, tags)[source]

Apply tag to tex command.

For example, if tag is [1,2,3,4,5]:

Input: This is a \formula{epic} and this is not
Output: This is a 1\formula2{3epic4}5 and this is not
Parameters:
Return type:

str

Returns:

Code with tags

pydetex.utils.apply_tag_tex_commands_no_argv(s, tags)[source]

Apply tag to tex command.

For example, if tag is [1,2]:

Input: This is a \formula and this is not.
Output: This is a 1\formula2 and this is not
Parameters:
Return type:

str

Returns:

Code with tags

pydetex.utils.check_repeated_words(s, lang, min_chars, window, stopwords, stemming, ignore=None, remove_tokens=None, font_tag_format='', font_param_format='', font_normal_format='', tag='repeated')[source]

Check repeated words.

Parameters:
  • s (str) – Text

  • lang (str) – Language code

  • min_chars (int) – Min chars to accept

  • window (int) – Window words span to check

  • stopwords (bool) – Use stopwords

  • stemming (bool) – Use stemming

  • ignore (Optional[List[str]]) – Ignore a list of words

  • remove_tokens (Optional[List[str]]) – Remove keys before verify repeat

  • font_tag_format (str) – Tag’s format

  • font_param_format (str) – Param’s format

  • font_normal_format (str) – Normal’s format

  • tag (str) – Tag’s name

Return type:

str

Returns:

Text with repeated words marked

pydetex.utils.complete_langs_dict(lang)[source]

Completes a language dict. Assumes 'en' is the main language.

Parameters:

lang (Dict[str, Dict[str, str]]) – Language dict

Return type:

None

pydetex.utils.detect_language(s)[source]

Detects languages.

Parameters:

s (str) – String

Return type:

str

Returns:

Detected language

pydetex.utils.find_tex_command_char(s, symbols_char)[source]

Find symbols command positions.

Example:

       00000000001111111111....
       01234567890123456789....
Input: This is a $formula$ and this is not.
Output: ((10, 11, 17, 18), ...)
Parameters:
  • s (str) – Latex string code

  • symbols_char (List[Tuple[str, str, bool]]) – Symbols to check [(initial, final, ignore escape), ...]

Return type:

Tuple[Tuple[int, int, int, int], ...]

Returns:

Positions

pydetex.utils.find_tex_commands(s, offset=0)[source]

Find all tex commands within a code.

         00000000001111111111222
         01234567890123456789012
                 a        b c  d
Example: This is \aCommand{nice}...
Output: ((8, 16, 18, 21), ...)
Parameters:
  • s (str) – Latex string code

  • offset (int) – Offset added to the positioning, useful when using recursive calling on substrings

Return type:

Tuple[Tuple[int, int, int, int, bool], ...]

Returns:

Tuple if found codes (a, b, c, d, command continues)

pydetex.utils.find_tex_commands_noargv(s)[source]

Find all tex commands with no arguments within a code.

         00000000001111111111222
         01234567890123456789012
                 x       x
Example: This is Command ...
Output: ((8,16), ...)
Parameters:

s (str) – Latex string code

Return type:

Tuple[Tuple[int, int], ...]

Returns:

Tuple if found codes

pydetex.utils.find_tex_environments(s)[source]

Find all tex commands within a code.

Example:

         0000000000111111111122222222223333333333
         0123456789012345678901234567890123456789
                 a           b        c         d
Example: This is egin{nice}[cmd]my...\end{nice}
Output: (('nice', 8, 20, 29, 39, 'parentenv', 0, -1), ...)

This method also returns the name of the parent environment, the depth of the environment, and the depth of the item enviroment (if itemizable).

Parameters:

s (str) – Latex string code

Return type:

Tuple[Tuple[str, int, int, int, int, str, int, int], ...]

Returns:

Tuple if found environment (env_name, a, b, c, d, parent_env_name, env_depth, env_item_depth)

pydetex.utils.format_number_d(n, c)[source]

Formats a number on thousands.

Parameters:
  • n (int) – Number

  • c (str) – Format char

Return type:

str

Returns:

Formatted number

pydetex.utils.get_diff_startend_word(original, new)[source]

Return the difference of the word from start and end, for example:

original XXXwordYY
new         word
diff = (XXX, YY)
Parameters:
  • original (str) – Original word

  • new (str) – New word

Return type:

Tuple[str, str]

Returns:

Diff word

pydetex.utils.get_language_name(tag, lang='')[source]

Returns a language name from its tag.

Parameters:
  • tag (str) – Language tag (ISO 639)

  • lang (str) – Target language (ISO 639). If not supported, will return the English name

Return type:

str

Returns:

Language name

pydetex.utils.get_local_path()[source]
Return type:

str

Returns:

Returns the app local path

pydetex.utils.get_number_of_day()[source]

Return the number of the day from the current year.

Return type:

int

Returns:

Day number

pydetex.utils.get_tex_commands_args(s, pos=False)[source]

Get all the arguments from a tex command. Each command argument has a boolean indicating if that is optional or not.

Example: This is Command[\label{}]{nice} and...
Output: (('aCommand', ('\label{}', True), ('nice', False)), ...)
Parameters:
  • s (str) – Latex string code

  • pos (bool) – Add the numerical position of the original string at the last position

Return type:

Tuple[Tuple[Union[str, Tuple[str, bool], Tuple[int, int]], ...], ...]

Returns:

Arguments

pydetex.utils.get_word_from_cursor(s, pos)[source]

Return the word from a string on a given cursor.

Parameters:
  • s (str) – String

  • pos (int) – Position to check the string

Return type:

Tuple[str, int, int]

Returns:

Word, position start, position end

pydetex.utils.make_stemmer(lang)[source]

Returns a stemmer.

Parameters:

lang (str) – Lang code

Return type:

Optional[SnowballStemmer]

Returns:

Stemmer or None if not available

pydetex.utils.open_file(f)[source]

Open file and return its string.

Parameters:

f (str) – Filename

Return type:

str

Returns:

File content

pydetex.utils.split_tags(s, tags)[source]

Split a string based on tags, each line is then tagged.

String format: [TAG1]new line[TAG2]this is[TAG1]very epic

Output: [(‘TAG1’, ‘newline’), (‘TAG’, ‘this is), (‘TAG1’, ‘very epic’) … ]

Parameters:
  • s (str) – String

  • tags (List[str]) – Tag list

Return type:

List[Tuple[str, str]]

Returns:

Split tags

pydetex.utils.syntax_highlight(s)[source]

Syntax highlighter.

Parameters:

s (str) – Latex string code

Return type:

str

Returns:

Code with format

pydetex.utils.tex_to_unicode(s)[source]

Transforms tex code to unicode.

Parameters:

s (str) – Latex string code

Return type:

str

Returns:

Text in unicode

pydetex.utils.tokenize(s)[source]

Tokenize a given word.

Parameters:

s (str) – Word

Return type:

str

Returns:

Tokenized word

pydetex.utils.validate_float(p)[source]

Validate a float.

Parameters:

p (str) – Value

Return type:

bool

Returns:

True if integer

pydetex.utils.validate_int(p)[source]

Validate an integer.

Parameters:

p (str) – Value

Return type:

bool

Returns:

True if integer