[Vp-reproduce-subgroup] Data availability

Tue Jan 4 02:15:35 PST 2022

Reviewer says: “This section seems to outline limitations of
available data, but again makes no recommendations or proposed
solution to any of the issues raised. Is this the intention?
Most of the issues raised here reflect limitations of experimental
science or data privacy, which likely cannot be meaningfully
addressed by the modeling community.”

The relevant paragraph is (I think): "Data availability to
rationalize calibration and validation of models is crucial
but often not possible because of data sharing policy and privacy
(especially for individual human data). Moreover, undisclosed
data from industry sponsored clinical trials used in model
building and validation generally excludes many useful models
from any assessment by the scientific community."

There are some partial answers to this for personally identifiable 
data. One is to develop ways to generate synthetic data that is
similar enough to the original data that it’s good for calibrating
models and whatnot but does not contain information about any
real individual. This is not easy to do and is an active area
of research (especially for network-shaped data). We can simply
point out that more research in developing the methods for this
is needed. For validation, where you want to query a database
based on model output to check that the output is consistent
with what’s in the database, the differential privacy literature
might help. That gives a way to put bounds on the information
that leaks from the database when answering queries and those
bounds can be tuned to whatever is considered acceptable. Again,
more research needed for adapting this idea to suit model
making needs.

Would extending that paragraph along those lines work?

Cheers,
-w