[Vp-integration-subgroup] Another case study of model integration

Jacob Barhak jacob.barhak at gmail.com
Tue Mar 30 11:23:23 PDT 2021


Greetings Integration sub-group,

Below you will find another attempt to integrate a few models created by
Lucas Boettcher into a COVID-19 model.

Those interested in following the details you will find our
correspondence in that thread to show difficulties in integrating models.

I will attempt to summarize for those with little time to follow back
details.

Lucas had several models that we attempted to reuse:
- Recovery model and incubation model based on Singapore data
- Several mortality models - one based on CDC data
- An infectiousess model based on a previous version of
https://doi.org/10.1038/s41591-020-0869-5

So far, after roughly 2 weeks of correspondence we were able to:
1. transmit the infectiousness profile and make sure I can implement it
properly - trace it back to data to make sure it is reusable. Note that in
this case we were using the same language - python and still transmission
of the formula was not straightforward since there was ambiguity in forms
of the function that can be defined in different ways.

2. Determine that Recovery / incubation models cannot be reused currently
since the data source that made the data available is not responding and
did not specify usage terms. I asked assistance from this mailing list to
contact the entity responsible for the data in this message:
https://lists.simtk.org/pipermail/vp-integration-subgroup/2021-March/000043.html

If you can help, please respond.

3. The Mortality model was not fully defined and I will wait for
publication of the preprint - hopefully Lucase will transmit it to this
mailing list. However, I highly suggest people look at his paper that
discusses mortality - it shows some important aspects of counting numbers
and how confusing something as reported  mortality numbers can be.  You can
find the paper here:
https://doi.org/10.1088/1478-3975/ab9e59

For those interested in the fine details - please keep on reading the
correspondence below in reverse chronological order.

Feedback from subgroup members will be appreciated.

            Jacob




On Tue, Mar 30, 2021 at 8:21 AM LUCAS BOETTCHER <lucasb at g.ucla.edu> wrote:

> Hi Jacob
>
> Yes, please feel free to add our discussion to the mailing list.
>
> Best
>
> Lucas
>
> On Tue, Mar 30, 2021 at 10:37 AM Jacob Barhak <jacob.barhak at gmail.com>
> wrote:
>
>> Thanks Lucas,
>>
>> These are all good news. Since the recovery function is associated with
>> the Singapore data, then we can hold with it until we authenticate the data.
>>
>> The infectiousness curve you mentioned is based on an article stating
>> that there is no restriction on data access in the data availability
>> section.- yet it would be nice to write a note to the authors about using
>> their data - it is good scholarship - and I noticed that those authors
>> actually correspond - look at the correction to their paper. So it would be
>> nice to write them an email indicating their data was useful. I think their
>> correction does not involve data change, so if you used their data, you
>> should be fine - yet it is worth another check
>>
>> I will wait for your mortality presprint when it is available.
>>
>> I think the discussion in this thread is good enough to go public in the
>> maling list as it seems to me now - so if you approve, I will add the
>> integration mailing list to the recipient list and summarize the
>> difficulties in integration we encountered. It is important people can see
>> with their own eyes the difficulties as they appear in practice. Hopefully
>> those cases will help support methods that will improve things in the long
>> run.
>>
>> I hope you still approve of this going public.
>>
>>           Jacob
>>
>>
>>
>> On Tue, Mar 30, 2021 at 2:16 AM LUCAS BOETTCHER <lucasb at g.ucla.edu>
>> wrote:
>>
>>> Hi Jacob
>>>
>>> Yes, I'll try to clarify some points below.
>>>
>>> On Mon, Mar 29, 2021 at 9:31 PM Jacob Barhak <jacob.barhak at gmail.com>
>>> wrote:
>>>
>>>> Thanks Lucas,
>>>>
>>>> You will have to bear with me. The amount of information you
>>>> transmitted is actually non trivial and as much as you tried to
>>>> communicate it clearly it is just too much condensed in one email. I
>>>> already got confused as it seems.
>>>>
>>>> Allow me to clarify with a few questions:
>>>>
>>>> 1) the Singapore data and the python program you sent were for
>>>> recovery /  incubation and it is based on the singapore data - correct?
>>>>
>>> >> Yes, that's correct.
>>>
>>>
>>>> 2) The infectiousness curve we reconstructed is Eq (7) in your
>>>> mortality paper - What data did you fit it to? Is it also fittet on the
>>>> Singapore data?
>>>>
>>> >> We inferred this curve from the first (uncorrected) version of "He,
>>> X., Lau, E. H., Wu, P., Deng, X., Wang, J., Hao, X., ... & Leung, G. M.
>>> (2020). Temporal dynamics in viral shedding and transmissibility of
>>> COVID-19. *Nature medicine*, *26*(5), 672-675."
>>>
>>>>
>>>> 3) What is the equation for mortality I can use to plug in with other
>>>> mortality functions? I see Table 2 summarizing different formats to
>>>> calculate mortality, yet I need a more formal equation I can use that is a
>>>> function of parameters such as MortalityProbablityPDF(
>>>> TimeSinceInfectionInDays, AgeInYears).
>>>>
>>>> >> Our first mortality paper appeared when there was little knowledge
>>> about age and mortality characteristics. We proposed some functional forms,
>>> but I think that there are better estimates available now. We're about to
>>> finalize another manuscript with a more advanced temporal network model
>>> with age structure/age-dependent mortality and different communities. I
>>> will share the preprint with you as soon as possible.
>>>
>>>
>>>> If you used CDC data such as
>>>> https://www.cdc.gov/mmwr/volumes/69/wr/mm6912e2.htm?s_cid=mm6912e2_w
>>>> then there are no restrictions on yuse sicne US governemtn data is
>>>> considered public domain in most cases - there are very rare case where
>>>> government provies a license since data was acquired from a 3rd party, yet
>>>> generally, in the US government publications have no copyright - in fact I
>>>> think it is similar in some other countries - yet I am not a lawyer - so it
>>>> is worth checking.
>>>>
>>> >> Ok, good to know.
>>>
>>>
>>>>
>>>> I must admit that I already got confused from the amount of information
>>>> with the infectiousness data and in my mind associated it with the
>>>> Singapore data - hopefully it is not associated and can be reused.
>>>>
>>>>                Jacob
>>>>
>>>> On Mon, Mar 29, 2021 at 1:27 AM LUCAS BOETTCHER <lucasb at g.ucla.edu>
>>>> wrote:
>>>>
>>>>> Hi Jacob
>>>>>
>>>>> Yes, let's proceed. The mortality datasets are taken from various
>>>>> statistical offices and the CDC.
>>>>>
>>>>> If you're mainly interested in US mortality statistics, we just have
>>>>> to contact the CDC and ask about these licencing issues.
>>>>>
>>>>> Best
>>>>>
>>>>> Lucas
>>>>>
>>>>>
>>>>> On Sunday, March 28, 2021, Jacob Barhak <jacob.barhak at gmail.com>
>>>>> wrote:
>>>>> > Hi Lucas,
>>>>> > It seems that data from the Singapure web site cannot be verified -
>>>>> I sent an email to the mailing list in hope someone has a contact in
>>>>> Singapore that can help with verifying the data and its usage terms.
>>>>> > I suggest we wait a bit more and if we still cannot move forward
>>>>> with that data, we can focus on other elements I can reuse from your paper
>>>>> towards integration. I already have several infectiousness curves, so we
>>>>> can perhaps focus on mortality if this in not connected to the Singapore
>>>>> data.
>>>>> > I hope this makes sense to you and moves us forward.
>>>>> >              Jacob
>>>>> >
>>>>> >
>>>>> > On Wed, Mar 17, 2021 at 11:49 AM LUCAS BOETTCHER <lucasb at g.ucla.edu>
>>>>> wrote:
>>>>> >>
>>>>> >> Thanks for your comments! I checked everything; responses are below.
>>>>> >>
>>>>> >> On Wed, Mar 17, 2021 at 12:51 PM Jacob Barhak <
>>>>> jacob.barhak at gmail.com> wrote:
>>>>> >>>
>>>>> >>> Thanks Lucas,
>>>>> >>> This is a good discussion since it shows more aspects of
>>>>> integration difficulties.
>>>>> >>> First thanks for being specific about the use of the gamma
>>>>> function to calculate infectiousness. Yet even with your clarifications, it
>>>>> looks a bit confusing to me and I want to verify that I am not misusing it.
>>>>> Therefore let me confirm with you that reimplementation is correct by
>>>>> giving two values of x:
>>>>> >>> >>> import scipy
>>>>> >>> >>> from scipy.stats import gamma
>>>>> >>> >>> a=8
>>>>> >>> >>> b=1.25
>>>>> >>> >>> x=3
>>>>> >>> >>> b*gamma.pdf(b*x, a)
>>>>> >>> 0.060826670304049466
>>>>> >>> >>> x=4
>>>>> >>> >>> b*gamma.pdf(b*x, a)
>>>>> >>> 0.13055607869631744
>>>>> >>>
>>>>> >>> And please confirm that x in that example is time in days from
>>>>> infection.
>>>>> >>
>>>>> >>
>>>>> >> >>>>>> Yes, I can confirm both. The numbers are correct and x is
>>>>> the time [days] from infection.
>>>>> >>
>>>>> >>>
>>>>> >>> If this is correct, then for my own purposes, I will need to get
>>>>> the probability of infection for each day from 0 to 18 . so this
>>>>> should generate the following results:
>>>>> >>> >>> import numpy as np
>>>>> >>> >>> x= np.array(range(19))
>>>>> >>> >>> b*gamma.pdf(b*x, a)
>>>>> >>> array([0.00000000e+00, 3.38829695e-04, 1.24257706e-02,
>>>>> 6.08266703e-02,
>>>>> >>>        1.30556079e-01, 1.78360666e-01, 1.83104790e-01,
>>>>> 1.54333118e-01,
>>>>> >>>        1.12599032e-01, 7.35756677e-02, 4.40725870e-02,
>>>>> 2.46064656e-02,
>>>>> >>>        1.29628669e-02, 6.50380786e-03, 3.13034629e-03,
>>>>> 1.45367341e-03,
>>>>> >>>        6.54334483e-04, 2.86572345e-04, 1.22498638e-04])
>>>>> >>>
>>>>> >>> If this is a good enough approximation, then the question is what
>>>>> does the numbers I generate mean? I assume this is the infectiousness
>>>>> density that sums to 1 since:
>>>>> >>> >>> sum(b*gamma.pdf(b*x, a))
>>>>> >>> 0.9999137765146388
>>>>> >>>
>>>>> >>
>>>>> >> >>>>>> Right, this distribution is normalized to 1. If one wants to
>>>>> obtain an infection rate for a disease model one has to use the methods
>>>>> described in the mortality paper I forwarded you. Equation 17 connects the
>>>>> infectiousness distribution with S0*R0, so one can fix the pre-factor in
>>>>> Eq. 16 using a given S0*R0 (which can be estimated) and Eq. 17.
>>>>> >>
>>>>> >> https://doi.org/10.1088/1478-3975/ab9e59
>>>>> >>
>>>>> >>> As for the data. This is a typical example of ambiguity with
>>>>> regards to reuse. The team that produced the data did not specify a license
>>>>> yet made the data available. Typically for academic purposes such data is
>>>>> considered fair use. However, since I am a sole proprietor - a for profit
>>>>> organization, then I have to be selective and inquire if I can reuse this
>>>>> data. Options are that:
>>>>> >>> 1. The authors wanted to make this data public domain and
>>>>> therefore there is no copyright statement on the web site
>>>>> >>> 2. The authors neglected to put a copyright / license since they
>>>>> are overworked and this was not the most important thing on their mind -
>>>>> they want the data to be useful, yet have not considered implications of
>>>>> reuse.
>>>>> >>> 3. The authors considered the issues and decided to release this
>>>>> like this - this situation is problematic since it makes reuse terms unclear
>>>>> >>> I suspect that the answer is one of the first two options, yet I
>>>>> think that this can be clarified by contacting the web site authors listed
>>>>> as UPCODE ACADEMY - their web site is: https://www.upcodeacademy.com/
>>>>> >>> I located their email to be:
>>>>> >>> hello at upcodeacademy.com
>>>>> >>>
>>>>> >>> I think we should ask them to be explicit about the data and ask
>>>>> to release it under CC0 to clear all doubts. Since you plan to upload the
>>>>> data to github, you rather know the license beforehand to make sure you
>>>>> properly define the license on Github.However, I will be happy to
>>>>> communicate with them for you.
>>>>> >>
>>>>> >> >> Ok, it would be great if you could clarify the Singapore data
>>>>> license. For my projects, I would just upload the data and specify the
>>>>> source. In your case it will be better to clarify the license type.
>>>>> >> I will send you a GitHub repo link later.
>>>>> >>
>>>>> >>>
>>>>> >>> Once you are ready with your github and remove the zip file, we
>>>>> can add the integration subgroup mailing list to the recipient list and
>>>>> make this conversation public. It shows again the difficulties with
>>>>> integration and how much effort and communication there should be. This is
>>>>> excellent for the subgroup.
>>>>> >>
>>>>> >> >> Ok, perfect. Thanks!
>>>>> >>
>>>>> >>>
>>>>> >>>                 Jacob
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Wed, Mar 17, 2021 at 3:21 AM LUCAS BOETTCHER <lucasb at g.ucla.edu>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Hi Jacob
>>>>> >>>> Thanks for your comments!
>>>>> >>>>
>>>>> >>>> I directly respond to your comments below.
>>>>> >>>>
>>>>> >>>> On Tue, Mar 16, 2021 at 11:45 PM Jacob Barhak <
>>>>> jacob.barhak at gmail.com> wrote:
>>>>> >>>>>
>>>>> >>>>> Many thanks Lucas,
>>>>> >>>>> This makes much more sense now.
>>>>> >>>>> However, just to show the subgroup that integration and
>>>>> reproducibility is still difficult, I want to show some ambiguity.
>>>>> >>>>
>>>>> >>>> >> Yes, I agree. Different definitions of certain distributions
>>>>> are confusing.
>>>>> >>>>
>>>>> >>>>>
>>>>> >>>>> The infectiousness curve you describe is a gamma distribution.
>>>>> There are two forms that it can be described by: 1) shape and rate, 2)
>>>>> shape and scale
>>>>> >>>>> https://en.wikipedia.org/wiki/Gamma_distribution
>>>>> >>>>> From your text I assume that n=8 is shape and lambda =1.25/day
>>>>> is a rate
>>>>> >>>>> So let me rewrite the function explicitly. Is the function I
>>>>> should use for infectiousness in day x:
>>>>> >>>>> f(t;a,b) = b^a*x^(a-1)*e^(-b*x) / (a-1)!
>>>>> >>>>> where a-8 and b=1.25 ?
>>>>> >>>>
>>>>> >>>> >> This is the correct representation (it's equation 7 in the
>>>>> mortality paper I shared).
>>>>> >>>>
>>>>> >>>>> If I need to implement it, do you think I can just use this
>>>>> python implementation?
>>>>> >>>>>
>>>>> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gamma.html
>>>>> >>>>
>>>>> >>>> >> Yes, this one works, but one has to add the second parameter.
>>>>> >>>> Using your notation from above, I would use:
>>>>> >>>>
>>>>> >>>> from scipy.stats import gamma
>>>>> >>>> b*gamma.pdf(b*x, a)
>>>>> >>>>>
>>>>> >>>>> As for the updated zip file - I got it to work and I can see the
>>>>> plots - the incubation period plot is less interesting for me, yet the
>>>>> recovery histogram is helpful - I actually played with the number of bins
>>>>> to see the data.
>>>>> >>>>> However, I do have a few questions.
>>>>> >>>>> 1) You used Singapore data - does this data have some
>>>>> restrictions on use - meaning is there a license associated with it that
>>>>> will restrict reuse of this data for commercial purposes or redistribution
>>>>> of the data. You will have to check the terms of data usage with the origin
>>>>> - if there is a copyright symbol and no license indicating otherwise, it
>>>>> becomes a problem  we need to discuss before going public. I checked the
>>>>> web site you quoted and did not see a copyright notice, nor did I see a way
>>>>> to download the data as CSV. so I assume you can communicate with the data
>>>>> source to clarify those details.
>>>>> >>>>
>>>>> >>>> >> The data is extracted from
>>>>> https://co.vid19.sg/singapore/cases/search (Now they have more than
>>>>> 6,000 tracked cases!). It's a really underestimated source of tracked Covid
>>>>> cases.
>>>>> >>>> I've never seen any copyright symbols or licenses and tried to
>>>>> contact some health officials from Singapore last year, but without
>>>>> success. If you find some contact details, we can ask them.
>>>>> >>>>
>>>>> >>>>>
>>>>> >>>>> 2) Assuming that there is no restriction on data, you should
>>>>> still specify license on the code you created - I suggested we are doing
>>>>> this towards releasing this under CC0, yet once we add the mailing list to
>>>>> this conversation, many people can access your zip file and we need to be
>>>>> clear on what is allowed to do with each version.
>>>>> >>>>
>>>>> >>>> >> I would suggest that we first create a cleaned-up version of
>>>>> my plotting script and upload it to one of your or my GitHub repos. Then
>>>>> I'll remove the ZIP, so that others just use the clean GitHub version.
>>>>> >>>>
>>>>> >>>>>
>>>>> >>>>> If the Singapore data is already public domain and you are
>>>>> willing to release your code under CC0 - I can proceed and process your
>>>>> code and create a model I will publish for you on Github. Yet you have to
>>>>> decide if you want the zip file to become public so others can view it.
>>>>> >>>>
>>>>> >>>> >> Yes, CC0 is fine.
>>>>> >>>>
>>>>> >>>>>
>>>>> >>>>> I did not add the mailing list email since I want you to be ok
>>>>> with details before we go public. Once we clear those issues, we can make
>>>>> the conversation public. As you can see I am cautious before I make things
>>>>> public - one reason for cautiousness is to show the subgroup what is proper
>>>>> practice and how models and data should be checked for licenses.
>>>>> >>>>
>>>>> >>>> >> That's great! I think it's good to pay attention to those
>>>>> details.
>>>>> >>>>
>>>>> >>>>>
>>>>> >>>>> In any case, many thanks for this - this is progress.
>>>>> >>>>>            Jacob
>>>>> >>>>>
>>>>> >>>>> On Tue, Mar 16, 2021 at 2:11 AM LUCAS BOETTCHER <
>>>>> lucasb at g.ucla.edu> wrote:
>>>>> >>>>>>
>>>>> >>>>>> Hi Jacob
>>>>> >>>>>> Yes, I meant equation 16 not 18 in [1]. This equation describes
>>>>> the infectiousness \beta(\tau) as a function of the time since infection
>>>>> \tau. The distribution parameters are as specified in my previous email and
>>>>> also described in [1].
>>>>> >>>>>> I updated the ZIP:
>>>>> http://lucas-boettcher.info/downloads/singapore_.zip
>>>>> >>>>>> There is no need anymore to have Latex connected to python to
>>>>> run this script. I'll add a YML environment file next time.
>>>>> >>>>>> I am fine with releasing everything I shared under CC0; please
>>>>> feel free to add our discussion to the mailing list.
>>>>> >>>>>> Best
>>>>> >>>>>> Lucas
>>>>> >>>>>>
>>>>> >>>>>> ---
>>>>> >>>>>> [1] Böttcher, L., Xia, M., & Chou, T. (2020). Why case fatality
>>>>> ratios can be misleading: individual-and population-based mortality
>>>>> estimates and factors influencing them. Physical Biology, 17(6), 065003.
>>>>> >>>>>> On Sun, Mar 14, 2021 at 6:35 PM LUCAS BOETTCHER <
>>>>> lucasb at g.ucla.edu> wrote:
>>>>> >>>>>>>
>>>>> >>>>>>> Hi Jacob
>>>>> >>>>>>> In [1] (Eq. 18) we used the gamma distribution
>>>>> >>>>>>> \beta(\tau)=\beta_0 \rho(\tau;n,\lambda),
>>>>> >>>>>>> to describe an infectiousness profile estimate from [2]. Here,
>>>>> \tau is the time since infection, n=8 (shape parameter), and
>>>>> \lambda=1.25/day (rate parameter). The amplitude \beta_0 S_0 can be
>>>>> estimated using R_0 estimates (see [1]).
>>>>> >>>>>>> Incubation period and recovery time profiles (incl. data from
>>>>> https://co.vid19.sg/cases) are stored here:
>>>>> http://lucas-boettcher.info/downloads/singapore_.zip
>>>>> >>>>>>> (I'll remove the ZIP in a few weeks, but you can download and
>>>>> store the data somewhere else if it's helpful for your research.)
>>>>> >>>>>>>
>>>>> >>>>>>> And regarding the license issue, please let me know what would
>>>>> be best for your work. I am not sure if CC0 might be the best solution for
>>>>> you:
>>>>> >>>>>>>
>>>>> https://opensource.stackexchange.com/questions/133/how-could-using-code-released-under-cc0-infringe-on-the-authors-patents
>>>>> >>>>>>> Best
>>>>> >>>>>>> Lucas
>>>>> >>>>>>> ---
>>>>> >>>>>>> [1] Böttcher, L., Xia, M., & Chou, T. (2020). Why case
>>>>> fatality ratios can be misleading: individual-and population-based
>>>>> mortality estimates and factors influencing them. Physical Biology, 17(6),
>>>>> 065003.
>>>>> >>>>>>> [2] He, X., Lau, E. H., Wu, P., Deng, X., Wang, J., Hao, X.,
>>>>> ... & Leung, G. M. (2020). Temporal dynamics in viral shedding and
>>>>> transmissibility of COVID-19. Nature medicine, 26(5), 672-675.
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.simtk.org/pipermail/vp-integration-subgroup/attachments/20210330/72ab174c/attachment-0001.html>


More information about the Vp-integration-subgroup mailing list