ZeffClient YAML Example

In this example we will create a record builder that will access a CSV file for information necessary to create the record.

QuickStart

This quickstart will download the example archive, unarchive it, change into the new directory, and then run a script in that directory that will do the rest of the example. At this point you will be asked some questions by the zeff init command — you will need to enter your org_id and user_id that your received from Zeff, but all other questions you may accept the defaults by hitting enter.

Steps

  1. Download: zeffclient_example_yaml.tar.bz2

  2. Decompress: tar -xjf zeffclient_example_yaml.tar.bz2

  3. Change directory: cd zeffclient_example_yaml

  4. Run quickstart script: ./quickstart.sh

How it Works

Project Directory

The project directory has a virtual environment setup in .venv by the quickstart.sh script. This environment has had ZeffClient installed. This may be activated at any time by source .venv/bin/activate.

The steps taken to setup the directory are:

  1. python -m venv .venv

  2. source .venv/bin/activate

  3. pip install --upgrade pip

  4. python -m pip install ZeffClient

The main command to work with ZeffClient is zeff. To quickly see what options and subcommands are available use zeff --help.

Record Config Generator

The generator.HousePriceRecordGenerator in generator.py will will yield a URL that identifies the file and the id from each properties record in the properties.yml file. For this example a URL is returned, but it is not limited to a URL and could be a string, file, etc.

For this particular example there is only one properties record in properties.yml and the URL returned is

file:///<root>/properties.yml?id=1395678

The <root> is the path to the example directory on your drive.

To test the generator by itself use the command ./generator.py or python generator.py.

Record Builder

The builder.HousePriceRecordBuilder in builder.py will take the configuration string given by the record config generator and will yield a record.

The file builder.py may be executed from the command line directly, and has a basic command line interface using argparse. This will aid you in writing and debugging your record builder, because you may work with a single record without needing to run the entire ZeffClient system.

The module uses the zeffclient.record.builder logger to indicate various stages of the record building process. You should also use this logger while building records for error reporting, warnings, information, and debugging.

The file builder.py has a class HousePriceRecordBuilder where all the code to build a new record for house prices is contained. This class will create a callable object that takes a single argument that has been yielded by the record generator. It has three steps: create a new record, add structured data to the record, and add unstructured data to the record.

26     def __call__(self, model: bool, record_config: str) -> Optional[Record]:
27         """Build and return a record.
28 
29         :param model: Flag to indicate if the record builder is building
30             records for training or for prediction. If model is true then
31             it is for prediction, but if false then it is for training and
32             any records not to be used for training should be filtered.
33 
34         :param record_config: Record configuration string created by
35             the record configuration generator.
36         """
37         urlparts = urllib.parse.urlsplit(record_config)
38         path = pathlib.Path(urlparts[2])
39         id = urlparts[3].split("=")[1]
40         LOGGER.info("Begin building ``HousePrice`` record from %s", id)
41         record = Record(name=id)
42         target = self.add_structured_data(record, path, id)
43         if not model and not target:
44             return None
45         self.add_unstructured_data(record, path.parent, id)
46         LOGGER.info("End building ``HousePrice`` record from %s", id)
47         return record

Adding structured data is done through a select on the properties table in the database and then converting each returned column (except id) into a structured data item.

49     def add_structured_data(self, record, path, id):
50         target_record = False
51         row = None
52         with open(path, "r") as ymlstream:
53             row = [r for r in yaml.load(ymlstream, Loader=yaml.Loader) if r["id"] == id]
54             if len(row) == 0:
55                 return
56             row = row[0]
57 
58         # Process each field in the record except for `id` and
59         # add it as a structured data to the record object
60         for key in row.keys():
61             if key == "id":
62                 continue
63             value = row[key]
64 
65             # Is the column a continuous or category datatype
66             if isinstance(value, (int, float)):
67                 dtype = DataType.CONTINUOUS
68             else:
69                 dtype = DataType.CATEGORY
70 
71             # Is this a target field
72             if key in ["estimate_mortgage"] and value is not None:
73                 target = Target.YES
74                 target_record = True
75             else:
76                 target = Target.NO
77 
78             # Create the structured data item and add it to the record
79             sd = StructuredData(name=key, value=value, data_type=dtype, target=target)
80             sd.record = record
81 
82         return target_record

Adding unstructured data is done through a select on the property_images table in the databse and then creating an unstructured data item.

84     def add_unstructured_data(self, record, path, id):
85 
86         img_path = path / f"images_{id}"
87 
88         # Process each jpeg file in the image path, create an
89         # unstructured data, and add that to the record data object.
90         for p in img_path.glob("**/*.jpeg"):
91             url = f"file://{p}"
92             file_type = FileType.IMAGE
93             group_by = "home_photo"
94             ud = UnstructuredData(url, file_type, group_by=group_by)
95             ud.record = record