Talk:Data scrubbing

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Removed paragraphs[edit]

I removed the following two paragraphs, they already have other pages on en.wiki. The second one may need to be rewritten and inserted somewhere in Sanitization (classified information), I think. - Mitchan (talk) 16:22, 14 October 2009 (UTC)[reply]

This is data cleansing:[edit]

Data scrubbing is the process of fixing or eliminating individual pieces of data that are incorrect, incomplete or duplicated before the data is passed to a data warehouse or another application.

This is Sanitization (classified information):[edit]

Data scrubbing is the process of taking a data set with individually identifiable information, and removing or altering the data in such a way that the usefulness of the data set is retained, but the identification of individuals contained in that data set is nearly impossible. Scrubbing should be accomplished using a protocol developed to anticipate potential problems. For instance, a common problem with scrubbed data is failure to anticipate the presence of an additional data source, which increases the potential for re-identification. A simplistic example of this would be removing names from from a list of hospital patients but not addresses. It would be relatively easy to re-identify the data using a simple phone book. Internet access to vast databases of information makes re-identification risks real. Medical research projects are an example of data that can be re-identified for improper use. The classic example is medical mapping data. While dots on a map are not very identifiable, single events without a specific address but within a specific geographical area can be re-identified. If a single dot for children exposed to malaria appears in a small town, it is entirely possible to narrow down external data, such as credit card purchases at a local pharmacy for malaria medicine. While the child didn't purchase the meds, there is no other malaria victim in town, so assuming the purchaser is the parent is reasonable. Algorithmic data scrubbing must be carefully handled, in that poor choice of algorithm can lead to predictable re-identification. For example, performing an xor on the SSN to create a new identifier would create a predictable series. Essentially the same limitations as found in encryption are present with data scrubbing. Use of hashes and non-linear functions or randomness is preferred to simple obscuring through some translation function. There is no formal standard for data scrubbing and no standard protocol base.

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on Data scrubbing. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 08:10, 7 December 2016 (UTC)[reply]