We try to build most of our products on sanitized data. However, some things rely on some set of customer data in order to make sure our data extract, transform, and load processes are working for a particular site.

Since access to the demo and testing sites are restricted to authorized users who have been through HIPPA training this is currently acceptable. Our testing sites are typically not sanitized. This allows the product owners and analysts to validate the customer ETL process is working as required. The Sandbox environments for Demo and V&V have most identifiable information sanitized. However, since some dates are meaningful to how the applications work, such as admission and discharge dates, they may not be sanitized.

One area people sometimes forget to be diligent is when they are taking screenshots for training or help documentation.

I wanted to get this post out there as a refresher on the data elements that we need to protect, and what may not be sanitized on our sandboxes.

Michael Marks did some research on this very topic for a white paper and he passed along some very good information.

You can find a lot of information on protected fields and de-identification at the HHS website under the Safe Harbor Guidance link.

Here is what they have listed as protected elements.

(2)(i) The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:

(A) Names

(B) All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:

(1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and

(2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000

(C) All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older

(D) Telephone numbers

(L) Vehicle identifiers and serial numbers, including license plate numbers

(E) Fax numbers

(M) Device identifiers and serial numbers

(F) Email addresses

(N) Web Universal Resource Locators (URLs)

(G) Social security numbers

(O) Internet Protocol (IP) addresses

(H) Medical record numbers

(P) Biometric identifiers, including finger and voice prints

(I) Health plan beneficiary numbers

(Q) Full-face photographs and any comparable images

(J) Account numbers

(R) Any other unique identifying number, characteristic, or code, except as permitted by paragraph (c) of this section [Paragraph (c) is presented below in the section “Re-identification”]; and

(K) Certificate/license numbers

 

When creating documentation it is important to verify what has been sanitized for the product and what has not been. Key items you may want to redact are admission or discharge dates. I also would not include any web url in documentation as it may reference an ID of some sort.

We are taking steps towards making sure all protected elements are sanitized in the sandbox environments, but this is currently a manual process that has to be coded for each product. As part of are overall security infrastructure we will be looking into tools to sanitize customer data as we get it, while keeping it as statistically consistent with production data as possible.

 

Again. Thank you to Michael Marks for the great information he passed along to me for this post.

 

What are you doing to obfuscate customer data in your environments for development? Is there specific software you recommend and what challenges did you face with it?

I would love to hear about your experiences in the comments below.