cookbase.parsers

Module contents

A package that includes different parsing tools used in the context of the Cookbase platform.

Submodules

cookbase.parsers.jsonfoodex

Parsing suite for the Cookbase platform from FoodEx2 data into JSON documents.

The main command, parsexml, allows for lossless translation from FoodEx2 XML data into a collection of JSON documents. Nonetheless, it also permits to filter out and discard the desired hierarchies together with the ingredients that belong only to those hierarchies. Field contents are parsed into Python built-in types (str, int and bool). The original ordering and format are respected, however there are a number of particularities when mapping into JSON to be considered:

  • The JSON output represents the content of the root <catalogue> tag.
  • The <hierarchyGroups> tag is mapped into JSON object that holds an array with the text from each contained <hierarchyGroup> tag.
  • The <hierarchyAssignment> tag is mapped into a JSON object whose key is the <hierarchyCode> tag content, and the value is a JSON document including all its data.
  • The <implicitAttribute> tag is mapped into a JSON object whose key is the <attributeCode> tag content, and the value is an array with the text from each contained <attributeValue> tag.

The -d/--discardedhierarchies option lets the user choose whether or not to discard any desired hierarchy (including the terms that are only related to them) by providing a list of hierarchy codes. By default, if not used, all hierarchies not directly related to food preparation are discarded: botanic, pest, biomo, legis, feed, partcon, place, vetdrug, report, fpurpose, replev, targcon and feedAddExpo. In case of wanting not to discard any hierarchy, the -d/--discardedhierarchies flag should be used providing no hierarchies to discard.

The -cb/--cookbase flag argument indicates to generate identifiers (_id) for each catalogue term suitable for the Cookbase platform.

The hierarchize command permits to build a JSON document describing a hierarchy tree.

cookbase.parsers.jsonfoodex.hierarchize(args: argparse.Namespace) → None[source]

Generates a JSON document describing a hierarchy tree.

Parameters:args (argparse.Namespace) – Command-line arguments
cookbase.parsers.jsonfoodex.parsexml(args: argparse.Namespace) → None[source]

Method implementing the parsing logic.

Parameters:args (argparse.Namespace) – Command-line arguments

cookbase.parsers.termcode

A module allowing to generate and translate numeric identifiers from the FoodEx2 term code strings. A term code consists of a string of five alphanumeric characters, e.g. 'A111J'. While most of the times they start with an A character, this module does not restrict to that.

cookbase.parsers.termcode.to_int(code: str) → int[source]

Function generating a numeric identifier from a FoodEx2 term code.

Parameters:code (str) – FoodEx2 term code
Returns:A numeric translation of the term code
Return type:int
cookbase.parsers.termcode.to_str(code: int) → str[source]

Function generating a FoodEx2 term code from its numeric representation.

Parameters:code (int) – A numeric identifier
Returns:A string translation in the form of a FoodEx2 term code
Return type:str

cookbase.parsers.utils

cookbase.parsers.utils.check_for_duplicate_keys(ordered_pairs: List[Tuple[Hashable, Any]]) → Dict[KT, VT][source]

Checks for duplicates on the keys of a JSON object.

The function is defined to be used as the object_pairs_hook argument of a json.load() method.

Parameters:ordered_pairs (list[tuple[Hashable, Any]]) – A list of key-value pairs representing all the content of a JSON object
Returns:A dictionary containing the JSON document
Return type:dict[str, Any]
Raises:ValueError: There is at least one duplicate key in the JSON object.
cookbase.parsers.utils.parse_cbr(path: str) → Dict[str, Any][source]

Parses a Cookbase Recipe (CBR).

Parameters:path (str) – The path to the CBR document
Returns:A dictionary containing the parsed CBR
Return type:dict[str, Any]
cookbase.parsers.utils.populate_collection(collection_dir: str, object_type: str) → None[source]

Bulk inserts CBDM objects into collections.

Parameters:
  • collection_dir (str) – The local path to the directory containing the objects to insert
  • object_type (str) – The type of object to insert into collection