ZeffClient CSV Example

In this example we will create a record builder that will access a CSV file for information necessary to create the record.

QuickStart

This quickstart will download the example archive, unarchive it, change into the new directory, and then run a script in that directory that will do the rest of the example. At this point you will be asked some questions by the zeff init command — you will need to enter your org_id and user_id that your received from Zeff, but all other questions you may accept the defaults by hitting enter.

Steps

  1. Download: zeffclient_example_csv.tar.bz2

  2. Decompress: tar -xjf zeffclient_example_csv.tar.bz2

  3. Change directory: cd zeffclient_example_csv

  4. Run quickstart script: ./quickstart.sh

How it Works

Project Directory

The project directory has a virtual environment setup in .venv by the quickstart.sh script. This environment has had ZeffClient installed. This may be activated at any time by source .venv/bin/activate.

The steps taken to setup the directory are:

  1. python -m venv .venv

  2. source .venv/bin/activate

  3. pip install --upgrade pip

  4. python -m pip install ZeffClient

The main command to work with ZeffClient is zeff. To quickly see what options and subcommands are available use zeff --help.

Record Config Generator

The generator.HousePriceRecordGenerator in generator.py will will yield a URL that identifies the file and the id from each properties record in the properties.csv file. For this example a URL is returned, but it is not limited to a URL and could be a string, file, etc.

For this particular example there is only one properties record in properties.yml and the URL returned is

file:///<root>/properties.csv?id=1395678

The <root> is the path to the example directory on your drive.

To test the generator by itself use the command ./generator.py or python generator.py.

Record Builder

The builder.HousePriceRecordBuilder in builder.py will take the configuration string given by the record config generator and will yield a record.

The file builder.py may be executed from the command line directly, and has a basic command line interface using argparse. This will aid you in writing and debugging your record builder, because you may work with a single record without needing to run the entire ZeffClient system.

The module uses the zeffclient.record.builder logger to indicate various stages of the record building process. You should also use this logger while building records for error reporting, warnings, information, and debugging.

The file builder.py has a class HousePriceRecordBuilder where all the code to build a new record for house prices is contained. This class will create a callable object that takes a single argument that has been yielded by the record generator. It has three steps: create a new record, add structured data to the record, and add unstructured data to the record.

25     def __call__(self, model: bool, record_config: str) -> Optional[Record]:
26         """Build and return a record.
27 
28         :param model: Flag to indicate if the record builder is building
29             records for training or for prediction. If model is true then
30             it is for prediction, but if false then it is for training and
31             any records not to be used for training should be filtered.
32 
33         :param record_config: Record configuration string created by
34             the record configuration generator.
35         """
36         urlparts = urllib.parse.urlsplit(record_config)
37         path = pathlib.Path(urlparts[2])
38         id = urlparts[3].split("=")[1]
39         LOGGER.info("Begin building ``HousePrice`` record from %s", id)
40         record = Record(name=id)
41         target = self.add_structured_data(record, path, id)
42         if not model and not target:
43             return None
44         self.add_unstructured_data(record, path.parent, id)
45         LOGGER.info("End building ``HousePrice`` record from %s", id)
46         return record

Adding structured data is done through a select on the properties table in the database and then converting each returned column (except id) into a structured data item.

48     def add_structured_data(self, record, path, id):
49         target_record = False
50         row = None
51         with open(path, "r") as csvfile:
52             row = [r for r in csv.DictReader(csvfile) if r["id"] == id]
53             if len(row) == 0:
54                 return
55             row = row[0]
56 
57         # Process each field in the record except for `id` and
58         # add it as a structured data to the record object.
59         for key in row.keys():
60             if key == "id":
61                 continue
62             value = row[key]
63 
64             # Is the column a continuous or category datatype
65             if isinstance(value, (int, float)):
66                 dtype = DataType.CONTINUOUS
67             else:
68                 dtype = DataType.CATEGORY
69 
70             # Is this a target field
71             if key in ["estimate_mortgage"] and value is not None:
72                 target = Target.YES
73                 target_record = True
74             else:
75                 target = Target.NO
76 
77             # Create the structured data item and add it to the
78             # structured data object
79             sd = StructuredData(name=key, value=value, data_type=dtype, target=target)
80             sd.record = record
81 
82         return target_record

Adding unstructured data is done through a select on the property_images table in the databse and then creating an unstructured data item.

84     def add_unstructured_data(self, record, path, id):
85 
86         img_path = path / f"images_{id}"
87 
88         # Process each jpeg file in the image path, create an
89         # unstructured data, and add it to the record object.
90         for p in img_path.glob("**/*.jpeg"):
91             url = f"file://{p}"
92             file_type = FileType.IMAGE
93             group_by = "home_photo"
94             ud = UnstructuredData(url, file_type, group_by=group_by)
95             ud.record = record