How to design a database for User Defined Fields?

If performance is the primary concern, I would go with #6… a table per UDF (really, this is a variant of #2). This answer is specifically tailored to this situation and the description of the data distribution and access patterns described.

Pros:

  1. Because you indicate that some UDFs
    have values for a small portion of
    the overall data set, a separate
    table would give you the best
    performance because that table will
    be only as large as it needs to be
    to support the UDF. The same holds true for the related indices.

  2. You also get a speed boost by limiting the amount of data that has to be processed for aggregations or other transformations. Splitting the data out into multiple tables lets you perform some of the aggregating and other statistical analysis on the UDF data, then join that result to the master table via foreign key to get the non-aggregated attributes.

  3. You can use table/column names that
    reflect what the data actually is.

  4. You have complete control to use data types,
    check constraints, default values, etc.
    to define the data domains. Don’t underestimate the performance hit resulting from on-the-fly data type conversion. Such
    constraints also help RDBMS query
    optimizers develop more effective
    plans.

  5. Should you ever need to use foreign
    keys, built-in declarative
    referential
    integrity is rarely out-performed by
    trigger-based or application level
    constraint enforcement.

Cons:

  1. This could create a lot of tables.
    Enforcing schema separation and/or a
    naming convention would alleviate
    this.

  2. There is more application code
    needed to operate the UDF definition
    and management. I expect this is
    still less code needed than for the
    original options 1, 3, & 4.

Other Considerations:

  1. If there is anything about the
    nature of the data that would make
    sense for the UDFs to be grouped,
    that should be encouraged. That way,
    those data elements can be combined
    into a single table. For example,
    let’s say you have UDFs for color,
    size, and cost. The tendency in the
    data is that most instances of this
    data looks like

     'red', 'large', 45.03 
    

    rather than

     NULL, 'medium', NULL
    

    In such a case, you won’t incur a
    noticeable speed penalty by
    combining the 3 columns in 1 table
    because few values would be NULL and
    you avoid making 2 more tables,
    which is 2 fewer joins needed when
    you need to access all 3 columns.

  2. If you hit a performance wall from a
    UDF that is heavily populated and
    frequently used, then that should be
    considered for inclusion in the
    master table.

  3. Logical table design can take you to
    a certain point, but when the record
    counts get truly massive, you also
    should start looking at what table
    partitioning options are provided by your RDBMS of choice.

Leave a Comment