Items and Users Properties

Items and users meta-data is called “properties”. For instance an item might be represented as follows:

{
  "item_id": "57243",
  "category_id": 3,
  "genre": "drama",
  "tags": ["family", "sci-fi"],
  "price": 9.99,
  "summary": "An eccentric yet compassionate extraterrestrial Time Lord zips through time and space [...]",
  "poster": "https://www.themoviedb.org/tv/57243-doctor-who.jpg"
}

Why and How?

Using rich properties gives two advantages:

  • it improves the recommendations, especially for both cold-start problems where the algorithm relies only on properties (such as Semantic Graph Embedding from genres and tags, or Deep Content Extraction from text and images)

  • it enables your client to dynamically filter the recommendations on items satisfying certain criteria (such as a price smaller than a threshold given at runtime, or a geo-location close to given coordinates, see Filtering on Item Property)

Like in a SQL database, properties must be defined before you can insert items or users with these properties. The API does not automatically create new properties if a new key is detected during an API request. This choice effectively prevents development errors as soon as they occur.

In most use case of the Crossing Minds API, you don’t need to upload and maintain your item catalog yourself. Instead it is more common to leverage a CDP integration (like Segment, mParticle, or Shopify), or share entire data files with your dedicated ML Engineer. Nevertheless for some use cases it is preferable that you maintain all the properties by calling the API endpoints. See Uploading and Maintaining An Item Catalog below.

Property Types

Item properties can be of various types. The available value_type may be found in the “Value Types” column of the following table:

Domain

Value Types

Kind

Filters

Examples

Example Values

integer

int, int<NBITS>

scalar

=, <

number of pages, year

0, -5, 12345

integer

uint,

uint<NBITS>

categorical scalar

=, <

category ID

0, 5, 12345

float

float, float<NBITS>

scalar

<

price

3.14, 9.99

string

unicode, unicode<NCHARS>

categorical

=

UTF8 genre name

"science-fiction", "drama"

bytes

bytes, bytes<NCHARS>

categorical

=

ASCII tag name, encrypted tag

0x5906d464

text

text

long text

ft

review, synopsis

"An eccentric yet..."

url

image_url

image

poster, screenshot

"https://te.st/img.jpg"

Notes:

  • Domains with = in “Filters” means they support the eq, neq, in, notin operators in recommendations filters

  • Domains with < in “Filters” means they support the lt, gt operators in recommendations filters

  • Domains with both = and < in “Filters” also support the lte, gte operators in recommendations filters

  • Domains with ft in “Filters” means they support the ftsearch operator in recommendations filters

  • Categorical domains contribute to Semantic Graph Embedding

  • Scalar domains contribute to Shallow Content Extraction

  • Long text and image domains contribute to Deep Content Extraction

  • In integer domains, valid <NBITS> are 8, 16, 32 and 64

  • In float domains, valid <NBITS> are 32 and 64

  • In string and bytes domains, <NCHARS> encodes the maximum size. It defaults to the maximum of 255 chars

Repeated Values

Any property may be “repeated”, meaning that a single item may have an array of many values for this property. This is typically the case with properties like “tags” or “genres”.

Properties with repeated=True also support recommendation filters. For most filter operators, an item with a repeated property satisfies a filter on this property if any of the repeated value satisfies the filter. See Filter Logic for the detailed logic of filtering on repeated values.

Array-Optimized Format (Optional)

In JSON, the repeated values are represented using a simple list.

When using a client supporting binary serialization of the data (not JSON), you can use an “array-optimized” format to represent repeated values of many items. Using this format will save you memory, CPU and network bandwidth, but it is more complicated. This array-optimized format requires to separate the item properties in two groups:

  • items, a single array to represent the non-repeated values, with at least the item ID;

  • items_m2m, a mapping from property name to arrays representing the repeated values.

The arrays in items_m2m store a collection of 2-tuples for each of the many-to-many relations. The first element item_index is the (0-based) index of the item with respect to items. The second element value_id is the property value.

For instance let’s take the following bulk of 4 items:

{
  "items": [
    {
      "item_id": "a",
      "price": 1.1,
      "tags": [],
      "genres": ["drama"]
    },
    {
      "item_id": "b",
      "price": 2.2,
      "tags": [1, 2, 3],
      "genres": ["drama", "comedy"]
    },
    {
      "item_id": "c",
      "price": 3.3,
      "tags": [1, 2],
      "genres": []
    },
    {
      "item_id": "d",
      "price": 4.4,
      "tags": [1],
      "genres": ["thriller", "romance"]
    }
  ]
}

The array-optimized format would be:

items

items_m2m->tags

items_m2m->genres

id

price

a

1.1

b

2.2

c

3.3

d

4.4

item_index

value_id

1

1

1

2

1

3

2

1

2

2

3

1

item_index

value_id

0

drama

1

drama

1

comedy

3

thriller

3

romance

Uploading and Maintaining An Item Catalog

In rare use cases, you cannot maintain your item catalog using only a CDP integration or sharing entire data files regularly. Then, you can use the API endpoints to upload and maintain the property values.

This setup is needed when it is important that newly created items are recommended immediately, without waiting for the next catalog sync. This use case is fairly rare, as typically items can be created in advance, and controlling which items are recommended is simply achieved with filters.

Sending large amount of data using HTTP calls can decrease performances. We suggest to follow these guidelines for better results:

  • use partial updates when possible, to only update what needs to be updated. Partial update is “shallow”, meaning it allows to send only a subset of key/value property mapping. But for repeated properties, you still send the full list of values (allowing to “delete” previous values from the list)

  • use the bulk operations when possible, to send many updates at the same time. Do not send huge requests to a single HTTP call, they would timeout. A good rule of thumb is to target a latency of single HTTP calls below one second, which would mean a few hundreds of KB approximately. Depending on the size of your values, it may be between ~50 items in the bulk to ~500 items per bulk.

  • use the default wait_for_completion=true when developing, to get error messages synchronously. When moving to production, use wait_for_completion=false so that our backend will move the update to a background queue, and immediately return an empty response.