[Vp-integration-subgroup] Another case study of model integration

Jacob Barhak jacob.barhak at gmail.com
Wed Apr 28 04:32:48 PDT 2021


Hi Lucas,

You may wish to compare your infectiousness model to the one generated by
Will Hart and Robin Thompson. The are the closest ones from the ones I
implemented and made available here:
https://github.com/Jacob-Barhak/COVID19Models/tree/main/COVID19_Infectiousness_Multi

If you download the html file alone to your machine, you should be able to
view all models.

Please note that I had issues with the umlaut 'o' character in your name
since I wanted to avoid Unicode issues, so I spelled your last name as
Bottcher - please let me know if you want it changed - I am sure you see
this problem a lot and may have a preference.

Hopefully you like the comparison to other potential models.

I will proceed with integrating this model into my ensemble.

           Jacob



On Tue, Mar 30, 2021 at 1:23 PM Jacob Barhak <jacob.barhak at gmail.com> wrote:

> Greetings Integration sub-group,
>
> Below you will find another attempt to integrate a few models created by
> Lucas Boettcher into a COVID-19 model.
>
> Those interested in following the details you will find our
> correspondence in that thread to show difficulties in integrating models.
>
> I will attempt to summarize for those with little time to follow back
> details.
>
> Lucas had several models that we attempted to reuse:
> - Recovery model and incubation model based on Singapore data
> - Several mortality models - one based on CDC data
> - An infectiousess model based on a previous version of
> https://doi.org/10.1038/s41591-020-0869-5
>
> So far, after roughly 2 weeks of correspondence we were able to:
> 1. transmit the infectiousness profile and make sure I can implement it
> properly - trace it back to data to make sure it is reusable. Note that in
> this case we were using the same language - python and still transmission
> of the formula was not straightforward since there was ambiguity in forms
> of the function that can be defined in different ways.
>
> 2. Determine that Recovery / incubation models cannot be reused currently
> since the data source that made the data available is not responding and
> did not specify usage terms. I asked assistance from this mailing list to
> contact the entity responsible for the data in this message:
> https://lists.simtk.org/pipermail/vp-integration-subgroup/2021-March/000043.html
>
> If you can help, please respond.
>
> 3. The Mortality model was not fully defined and I will wait for
> publication of the preprint - hopefully Lucase will transmit it to this
> mailing list. However, I highly suggest people look at his paper that
> discusses mortality - it shows some important aspects of counting numbers
> and how confusing something as reported  mortality numbers can be.  You can
> find the paper here:
> https://doi.org/10.1088/1478-3975/ab9e59
>
> For those interested in the fine details - please keep on reading the
> correspondence below in reverse chronological order.
>
> Feedback from subgroup members will be appreciated.
>
>             Jacob
>
>
>
>
> On Tue, Mar 30, 2021 at 8:21 AM LUCAS BOETTCHER <lucasb at g.ucla.edu> wrote:
>
>> Hi Jacob
>>
>> Yes, please feel free to add our discussion to the mailing list.
>>
>> Best
>>
>> Lucas
>>
>> On Tue, Mar 30, 2021 at 10:37 AM Jacob Barhak <jacob.barhak at gmail.com>
>> wrote:
>>
>>> Thanks Lucas,
>>>
>>> These are all good news. Since the recovery function is associated with
>>> the Singapore data, then we can hold with it until we authenticate the data.
>>>
>>> The infectiousness curve you mentioned is based on an article stating
>>> that there is no restriction on data access in the data availability
>>> section.- yet it would be nice to write a note to the authors about using
>>> their data - it is good scholarship - and I noticed that those authors
>>> actually correspond - look at the correction to their paper. So it would be
>>> nice to write them an email indicating their data was useful. I think their
>>> correction does not involve data change, so if you used their data, you
>>> should be fine - yet it is worth another check
>>>
>>> I will wait for your mortality presprint when it is available.
>>>
>>> I think the discussion in this thread is good enough to go public in the
>>> maling list as it seems to me now - so if you approve, I will add the
>>> integration mailing list to the recipient list and summarize the
>>> difficulties in integration we encountered. It is important people can see
>>> with their own eyes the difficulties as they appear in practice. Hopefully
>>> those cases will help support methods that will improve things in the long
>>> run.
>>>
>>> I hope you still approve of this going public.
>>>
>>>           Jacob
>>>
>>>
>>>
>>> On Tue, Mar 30, 2021 at 2:16 AM LUCAS BOETTCHER <lucasb at g.ucla.edu>
>>> wrote:
>>>
>>>> Hi Jacob
>>>>
>>>> Yes, I'll try to clarify some points below.
>>>>
>>>> On Mon, Mar 29, 2021 at 9:31 PM Jacob Barhak <jacob.barhak at gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Lucas,
>>>>>
>>>>> You will have to bear with me. The amount of information you
>>>>> transmitted is actually non trivial and as much as you tried to
>>>>> communicate it clearly it is just too much condensed in one email. I
>>>>> already got confused as it seems.
>>>>>
>>>>> Allow me to clarify with a few questions:
>>>>>
>>>>> 1) the Singapore data and the python program you sent were for
>>>>> recovery /  incubation and it is based on the singapore data - correct?
>>>>>
>>>> >> Yes, that's correct.
>>>>
>>>>
>>>>> 2) The infectiousness curve we reconstructed is Eq (7) in your
>>>>> mortality paper - What data did you fit it to? Is it also fittet on the
>>>>> Singapore data?
>>>>>
>>>> >> We inferred this curve from the first (uncorrected) version of "He,
>>>> X., Lau, E. H., Wu, P., Deng, X., Wang, J., Hao, X., ... & Leung, G. M.
>>>> (2020). Temporal dynamics in viral shedding and transmissibility of
>>>> COVID-19. *Nature medicine*, *26*(5), 672-675."
>>>>
>>>>>
>>>>> 3) What is the equation for mortality I can use to plug in with other
>>>>> mortality functions? I see Table 2 summarizing different formats to
>>>>> calculate mortality, yet I need a more formal equation I can use that is a
>>>>> function of parameters such as MortalityProbablityPDF(
>>>>> TimeSinceInfectionInDays, AgeInYears).
>>>>>
>>>>> >> Our first mortality paper appeared when there was little knowledge
>>>> about age and mortality characteristics. We proposed some functional forms,
>>>> but I think that there are better estimates available now. We're about to
>>>> finalize another manuscript with a more advanced temporal network model
>>>> with age structure/age-dependent mortality and different communities. I
>>>> will share the preprint with you as soon as possible.
>>>>
>>>>
>>>>> If you used CDC data such as
>>>>> https://www.cdc.gov/mmwr/volumes/69/wr/mm6912e2.htm?s_cid=mm6912e2_w
>>>>> then there are no restrictions on yuse sicne US governemtn data is
>>>>> considered public domain in most cases - there are very rare case where
>>>>> government provies a license since data was acquired from a 3rd party, yet
>>>>> generally, in the US government publications have no copyright - in fact I
>>>>> think it is similar in some other countries - yet I am not a lawyer - so it
>>>>> is worth checking.
>>>>>
>>>> >> Ok, good to know.
>>>>
>>>>
>>>>>
>>>>> I must admit that I already got confused from the amount of
>>>>> information with the infectiousness data and in my mind associated it with
>>>>> the Singapore data - hopefully it is not associated and can be reused.
>>>>>
>>>>>                Jacob
>>>>>
>>>>> On Mon, Mar 29, 2021 at 1:27 AM LUCAS BOETTCHER <lucasb at g.ucla.edu>
>>>>> wrote:
>>>>>
>>>>>> Hi Jacob
>>>>>>
>>>>>> Yes, let's proceed. The mortality datasets are taken from various
>>>>>> statistical offices and the CDC.
>>>>>>
>>>>>> If you're mainly interested in US mortality statistics, we just have
>>>>>> to contact the CDC and ask about these licencing issues.
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Lucas
>>>>>>
>>>>>>
>>>>>> On Sunday, March 28, 2021, Jacob Barhak <jacob.barhak at gmail.com>
>>>>>> wrote:
>>>>>> > Hi Lucas,
>>>>>> > It seems that data from the Singapure web site cannot be verified -
>>>>>> I sent an email to the mailing list in hope someone has a contact in
>>>>>> Singapore that can help with verifying the data and its usage terms.
>>>>>> > I suggest we wait a bit more and if we still cannot move forward
>>>>>> with that data, we can focus on other elements I can reuse from your paper
>>>>>> towards integration. I already have several infectiousness curves, so we
>>>>>> can perhaps focus on mortality if this in not connected to the Singapore
>>>>>> data.
>>>>>> > I hope this makes sense to you and moves us forward.
>>>>>> >              Jacob
>>>>>> >
>>>>>> >
>>>>>> > On Wed, Mar 17, 2021 at 11:49 AM LUCAS BOETTCHER <lucasb at g.ucla.edu>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Thanks for your comments! I checked everything; responses are
>>>>>> below.
>>>>>> >>
>>>>>> >> On Wed, Mar 17, 2021 at 12:51 PM Jacob Barhak <
>>>>>> jacob.barhak at gmail.com> wrote:
>>>>>> >>>
>>>>>> >>> Thanks Lucas,
>>>>>> >>> This is a good discussion since it shows more aspects of
>>>>>> integration difficulties.
>>>>>> >>> First thanks for being specific about the use of the gamma
>>>>>> function to calculate infectiousness. Yet even with your clarifications, it
>>>>>> looks a bit confusing to me and I want to verify that I am not misusing it.
>>>>>> Therefore let me confirm with you that reimplementation is correct by
>>>>>> giving two values of x:
>>>>>> >>> >>> import scipy
>>>>>> >>> >>> from scipy.stats import gamma
>>>>>> >>> >>> a=8
>>>>>> >>> >>> b=1.25
>>>>>> >>> >>> x=3
>>>>>> >>> >>> b*gamma.pdf(b*x, a)
>>>>>> >>> 0.060826670304049466
>>>>>> >>> >>> x=4
>>>>>> >>> >>> b*gamma.pdf(b*x, a)
>>>>>> >>> 0.13055607869631744
>>>>>> >>>
>>>>>> >>> And please confirm that x in that example is time in days from
>>>>>> infection.
>>>>>> >>
>>>>>> >>
>>>>>> >> >>>>>> Yes, I can confirm both. The numbers are correct and x is
>>>>>> the time [days] from infection.
>>>>>> >>
>>>>>> >>>
>>>>>> >>> If this is correct, then for my own purposes, I will need to get
>>>>>> the probability of infection for each day from 0 to 18 . so this
>>>>>> should generate the following results:
>>>>>> >>> >>> import numpy as np
>>>>>> >>> >>> x= np.array(range(19))
>>>>>> >>> >>> b*gamma.pdf(b*x, a)
>>>>>> >>> array([0.00000000e+00, 3.38829695e-04, 1.24257706e-02,
>>>>>> 6.08266703e-02,
>>>>>> >>>        1.30556079e-01, 1.78360666e-01, 1.83104790e-01,
>>>>>> 1.54333118e-01,
>>>>>> >>>        1.12599032e-01, 7.35756677e-02, 4.40725870e-02,
>>>>>> 2.46064656e-02,
>>>>>> >>>        1.29628669e-02, 6.50380786e-03, 3.13034629e-03,
>>>>>> 1.45367341e-03,
>>>>>> >>>        6.54334483e-04, 2.86572345e-04, 1.22498638e-04])
>>>>>> >>>
>>>>>> >>> If this is a good enough approximation, then the question is what
>>>>>> does the numbers I generate mean? I assume this is the infectiousness
>>>>>> density that sums to 1 since:
>>>>>> >>> >>> sum(b*gamma.pdf(b*x, a))
>>>>>> >>> 0.9999137765146388
>>>>>> >>>
>>>>>> >>
>>>>>> >> >>>>>> Right, this distribution is normalized to 1. If one wants
>>>>>> to obtain an infection rate for a disease model one has to use the methods
>>>>>> described in the mortality paper I forwarded you. Equation 17 connects the
>>>>>> infectiousness distribution with S0*R0, so one can fix the pre-factor in
>>>>>> Eq. 16 using a given S0*R0 (which can be estimated) and Eq. 17.
>>>>>> >>
>>>>>> >> https://doi.org/10.1088/1478-3975/ab9e59
>>>>>> >>
>>>>>> >>> As for the data. This is a typical example of ambiguity with
>>>>>> regards to reuse. The team that produced the data did not specify a license
>>>>>> yet made the data available. Typically for academic purposes such data is
>>>>>> considered fair use. However, since I am a sole proprietor - a for profit
>>>>>> organization, then I have to be selective and inquire if I can reuse this
>>>>>> data. Options are that:
>>>>>> >>> 1. The authors wanted to make this data public domain and
>>>>>> therefore there is no copyright statement on the web site
>>>>>> >>> 2. The authors neglected to put a copyright / license since they
>>>>>> are overworked and this was not the most important thing on their mind -
>>>>>> they want the data to be useful, yet have not considered implications of
>>>>>> reuse.
>>>>>> >>> 3. The authors considered the issues and decided to release this
>>>>>> like this - this situation is problematic since it makes reuse terms unclear
>>>>>> >>> I suspect that the answer is one of the first two options, yet I
>>>>>> think that this can be clarified by contacting the web site authors listed
>>>>>> as UPCODE ACADEMY - their web site is: https://www.upcodeacademy.com/
>>>>>>
>>>>>> >>> I located their email to be:
>>>>>> >>> hello at upcodeacademy.com
>>>>>> >>>
>>>>>> >>> I think we should ask them to be explicit about the data and ask
>>>>>> to release it under CC0 to clear all doubts. Since you plan to upload the
>>>>>> data to github, you rather know the license beforehand to make sure you
>>>>>> properly define the license on Github.However, I will be happy to
>>>>>> communicate with them for you.
>>>>>> >>
>>>>>> >> >> Ok, it would be great if you could clarify the Singapore data
>>>>>> license. For my projects, I would just upload the data and specify the
>>>>>> source. In your case it will be better to clarify the license type.
>>>>>> >> I will send you a GitHub repo link later.
>>>>>> >>
>>>>>> >>>
>>>>>> >>> Once you are ready with your github and remove the zip file, we
>>>>>> can add the integration subgroup mailing list to the recipient list and
>>>>>> make this conversation public. It shows again the difficulties with
>>>>>> integration and how much effort and communication there should be. This is
>>>>>> excellent for the subgroup.
>>>>>> >>
>>>>>> >> >> Ok, perfect. Thanks!
>>>>>> >>
>>>>>> >>>
>>>>>> >>>                 Jacob
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Wed, Mar 17, 2021 at 3:21 AM LUCAS BOETTCHER <
>>>>>> lucasb at g.ucla.edu> wrote:
>>>>>> >>>>
>>>>>> >>>> Hi Jacob
>>>>>> >>>> Thanks for your comments!
>>>>>> >>>>
>>>>>> >>>> I directly respond to your comments below.
>>>>>> >>>>
>>>>>> >>>> On Tue, Mar 16, 2021 at 11:45 PM Jacob Barhak <
>>>>>> jacob.barhak at gmail.com> wrote:
>>>>>> >>>>>
>>>>>> >>>>> Many thanks Lucas,
>>>>>> >>>>> This makes much more sense now.
>>>>>> >>>>> However, just to show the subgroup that integration and
>>>>>> reproducibility is still difficult, I want to show some ambiguity.
>>>>>> >>>>
>>>>>> >>>> >> Yes, I agree. Different definitions of certain distributions
>>>>>> are confusing.
>>>>>> >>>>
>>>>>> >>>>>
>>>>>> >>>>> The infectiousness curve you describe is a gamma distribution.
>>>>>> There are two forms that it can be described by: 1) shape and rate, 2)
>>>>>> shape and scale
>>>>>> >>>>> https://en.wikipedia.org/wiki/Gamma_distribution
>>>>>> >>>>> From your text I assume that n=8 is shape and lambda =1.25/day
>>>>>> is a rate
>>>>>> >>>>> So let me rewrite the function explicitly. Is the function I
>>>>>> should use for infectiousness in day x:
>>>>>> >>>>> f(t;a,b) = b^a*x^(a-1)*e^(-b*x) / (a-1)!
>>>>>> >>>>> where a-8 and b=1.25 ?
>>>>>> >>>>
>>>>>> >>>> >> This is the correct representation (it's equation 7 in the
>>>>>> mortality paper I shared).
>>>>>> >>>>
>>>>>> >>>>> If I need to implement it, do you think I can just use this
>>>>>> python implementation?
>>>>>> >>>>>
>>>>>> https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gamma.html
>>>>>> >>>>
>>>>>> >>>> >> Yes, this one works, but one has to add the second parameter.
>>>>>> >>>> Using your notation from above, I would use:
>>>>>> >>>>
>>>>>> >>>> from scipy.stats import gamma
>>>>>> >>>> b*gamma.pdf(b*x, a)
>>>>>> >>>>>
>>>>>> >>>>> As for the updated zip file - I got it to work and I can see
>>>>>> the plots - the incubation period plot is less interesting for me, yet the
>>>>>> recovery histogram is helpful - I actually played with the number of bins
>>>>>> to see the data.
>>>>>> >>>>> However, I do have a few questions.
>>>>>> >>>>> 1) You used Singapore data - does this data have some
>>>>>> restrictions on use - meaning is there a license associated with it that
>>>>>> will restrict reuse of this data for commercial purposes or redistribution
>>>>>> of the data. You will have to check the terms of data usage with the origin
>>>>>> - if there is a copyright symbol and no license indicating otherwise, it
>>>>>> becomes a problem  we need to discuss before going public. I checked the
>>>>>> web site you quoted and did not see a copyright notice, nor did I see a way
>>>>>> to download the data as CSV. so I assume you can communicate with the data
>>>>>> source to clarify those details.
>>>>>> >>>>
>>>>>> >>>> >> The data is extracted from
>>>>>> https://co.vid19.sg/singapore/cases/search (Now they have more than
>>>>>> 6,000 tracked cases!). It's a really underestimated source of tracked Covid
>>>>>> cases.
>>>>>> >>>> I've never seen any copyright symbols or licenses and tried to
>>>>>> contact some health officials from Singapore last year, but without
>>>>>> success. If you find some contact details, we can ask them.
>>>>>> >>>>
>>>>>> >>>>>
>>>>>> >>>>> 2) Assuming that there is no restriction on data, you should
>>>>>> still specify license on the code you created - I suggested we are doing
>>>>>> this towards releasing this under CC0, yet once we add the mailing list to
>>>>>> this conversation, many people can access your zip file and we need to be
>>>>>> clear on what is allowed to do with each version.
>>>>>> >>>>
>>>>>> >>>> >> I would suggest that we first create a cleaned-up version of
>>>>>> my plotting script and upload it to one of your or my GitHub repos. Then
>>>>>> I'll remove the ZIP, so that others just use the clean GitHub version.
>>>>>> >>>>
>>>>>> >>>>>
>>>>>> >>>>> If the Singapore data is already public domain and you are
>>>>>> willing to release your code under CC0 - I can proceed and process your
>>>>>> code and create a model I will publish for you on Github. Yet you have to
>>>>>> decide if you want the zip file to become public so others can view it.
>>>>>> >>>>
>>>>>> >>>> >> Yes, CC0 is fine.
>>>>>> >>>>
>>>>>> >>>>>
>>>>>> >>>>> I did not add the mailing list email since I want you to be ok
>>>>>> with details before we go public. Once we clear those issues, we can make
>>>>>> the conversation public. As you can see I am cautious before I make things
>>>>>> public - one reason for cautiousness is to show the subgroup what is proper
>>>>>> practice and how models and data should be checked for licenses.
>>>>>> >>>>
>>>>>> >>>> >> That's great! I think it's good to pay attention to those
>>>>>> details.
>>>>>> >>>>
>>>>>> >>>>>
>>>>>> >>>>> In any case, many thanks for this - this is progress.
>>>>>> >>>>>            Jacob
>>>>>> >>>>>
>>>>>> >>>>> On Tue, Mar 16, 2021 at 2:11 AM LUCAS BOETTCHER <
>>>>>> lucasb at g.ucla.edu> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> Hi Jacob
>>>>>> >>>>>> Yes, I meant equation 16 not 18 in [1]. This equation
>>>>>> describes the infectiousness \beta(\tau) as a function of the time since
>>>>>> infection \tau. The distribution parameters are as specified in my previous
>>>>>> email and also described in [1].
>>>>>> >>>>>> I updated the ZIP:
>>>>>> http://lucas-boettcher.info/downloads/singapore_.zip
>>>>>> >>>>>> There is no need anymore to have Latex connected to python to
>>>>>> run this script. I'll add a YML environment file next time.
>>>>>> >>>>>> I am fine with releasing everything I shared under CC0; please
>>>>>> feel free to add our discussion to the mailing list.
>>>>>> >>>>>> Best
>>>>>> >>>>>> Lucas
>>>>>> >>>>>>
>>>>>> >>>>>> ---
>>>>>> >>>>>> [1] Böttcher, L., Xia, M., & Chou, T. (2020). Why case
>>>>>> fatality ratios can be misleading: individual-and population-based
>>>>>> mortality estimates and factors influencing them. Physical Biology, 17(6),
>>>>>> 065003.
>>>>>> >>>>>> On Sun, Mar 14, 2021 at 6:35 PM LUCAS BOETTCHER <
>>>>>> lucasb at g.ucla.edu> wrote:
>>>>>> >>>>>>>
>>>>>> >>>>>>> Hi Jacob
>>>>>> >>>>>>> In [1] (Eq. 18) we used the gamma distribution
>>>>>> >>>>>>> \beta(\tau)=\beta_0 \rho(\tau;n,\lambda),
>>>>>> >>>>>>> to describe an infectiousness profile estimate from [2].
>>>>>> Here, \tau is the time since infection, n=8 (shape parameter), and
>>>>>> \lambda=1.25/day (rate parameter). The amplitude \beta_0 S_0 can be
>>>>>> estimated using R_0 estimates (see [1]).
>>>>>> >>>>>>> Incubation period and recovery time profiles (incl. data from
>>>>>> https://co.vid19.sg/cases) are stored here:
>>>>>> http://lucas-boettcher.info/downloads/singapore_.zip
>>>>>> >>>>>>> (I'll remove the ZIP in a few weeks, but you can download and
>>>>>> store the data somewhere else if it's helpful for your research.)
>>>>>> >>>>>>>
>>>>>> >>>>>>> And regarding the license issue, please let me know what
>>>>>> would be best for your work. I am not sure if CC0 might be the best
>>>>>> solution for you:
>>>>>> >>>>>>>
>>>>>> https://opensource.stackexchange.com/questions/133/how-could-using-code-released-under-cc0-infringe-on-the-authors-patents
>>>>>> >>>>>>> Best
>>>>>> >>>>>>> Lucas
>>>>>> >>>>>>> ---
>>>>>> >>>>>>> [1] Böttcher, L., Xia, M., & Chou, T. (2020). Why case
>>>>>> fatality ratios can be misleading: individual-and population-based
>>>>>> mortality estimates and factors influencing them. Physical Biology, 17(6),
>>>>>> 065003.
>>>>>> >>>>>>> [2] He, X., Lau, E. H., Wu, P., Deng, X., Wang, J., Hao, X.,
>>>>>> ... & Leung, G. M. (2020). Temporal dynamics in viral shedding and
>>>>>> transmissibility of COVID-19. Nature medicine, 26(5), 672-675.
>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.simtk.org/pipermail/vp-integration-subgroup/attachments/20210428/d95813ef/attachment-0001.html>


More information about the Vp-integration-subgroup mailing list