[Vp-reproduce-subgroup] Another case study of data
John Rice
john.rice at noboxes.org
Fri Apr 2 16:10:22 PDT 2021
I keep seeing disclaimers on willingness of any government data sources to set error bands of there data. They don’t seem to know how/what different localities counted, how local data got to states nor how states bundled and reported to fed. And apparently, the counting and collecting “rules” kept changing so they can’t reconstruct. Did see they reduced the mortality # by ? 9% ? after looking a big sample of death reports etc, and find then to be in error. But then some local counts ok. Did note that by the last couple months the tracking group at JUH seemed to be the most commonly cite source of counts.
John
Typed with two thumbs on my iPhone. (757) 318-0671
“Upon this gifted age, in its dark hour,
Rains from the sky a meteoric shower
Of facts . . . they lie unquestioned, uncombined.
Wisdom enough to leech us of our ill
Is daily spun; but there exists no loom
To weave it into fabric.”
–Edna St. Vincent Millay,
On Apr 2, 2021, at 17:56, Jacob Barhak <jacob.barhak at gmail.com> wrote:
Thanks Tingting,
The messages you get for moderation is because the mailer system does not accept images and the mailing lit mailer sends those to moderation -- I approve those regularly, yet it is a bother so it is better to avoid using images with this older mailing system. Also avoid using attachments - it is ok to use links.
I was able to download the data from the first link you now sent when I click the download button and click on csv. Please give me some time to compare it with the data on the second link that I was able to download. I will try to reproduce your findings by the time we speak.
Being able to download the data is the first step. So I am making progress.
I hope we can get to the bottom of this quickly.
Jacob
On Fri, Apr 2, 2021 at 4:15 PM Tingting Tang <ttang2 at sdsu.edu> wrote:
> Hi, Jacob,
>
> Not sure if the last message went through or not due to attachments.
>
> I have just redownloaded the data from ca open portal (https://data.ca.gov/dataset/covid-19-time-series-metrics-by-county-and-state/resource/1be1e43c-b4b2-4002-afb6-340bbcc85bbf) and link to the downloaded data in google sheet
> https://docs.google.com/spreadsheets/d/1WiPIloymqpe7QymsVP817f47MIr-_LFjc3r6tyPWOXE/edit?usp=sharing
>
> Also downloaded data from usa fact (https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/) cases data. They only have cumulative data, so daily cases number is computed.\
>
> Thanks,
> TIngting
>
>
>
> On Fri, Apr 2, 2021 at 12:29 PM Jacob Barhak <jacob.barhak at gmail.com> wrote:
>> Thanks Tingting,
>>
>> The resources in your first link cannot be downloaded:
>> https://data.ca.gov/dataset/covid-19-time-series-metrics-by-county-and-state
>>
>> There is a server error I encounter. If you can send a direct link to the spreadsheet data, it will help.
>>
>> I tried to verify your data and could not download the files. I was able to download a csv file using this direct link.
>> https://static.usafacts.org/public/data/covid-19/covid_confirmed_usafacts.csv?_ga=2.136905904.1065342744.1617373914-598050905.1617373914
>>
>> Unless I can download both files, I cannot verify the problem you encountered. Perhaps the data curators figured out the issue and are fixing it?
>>
>> Mistakes can happen, yet usually those are dealt with proper announcements. I wonder if this is the case here.
>>
>> If the source of data is not responding when complaints are raised, this indicates a problem. Hopefully it is just temporary. and things get resolved quickly by the time we talk.
>>
>> Regardless, thank you for pointing out difficulties you had. It is important more people realize the day to day difficulties modelers encounter.
>>
>> Jacob
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Apr 2, 2021 at 1:53 PM Tingting Tang <ttang2 at sdsu.edu> wrote:
>>> Hi, Jacob,
>>>
>>> For the three questions
>>>
>>>
>>> 1. Were those different infections / hospitalizations numbers?
>>>
>>> These are daily cases numbers from different websites.
>>> 2. Can you be specific and send the exact link to the data you used? I saw many links in your first link.
>>> For the CA data open portal data, the source website is https://data.ca.gov/dataset/covid-19-time-series-metrics-by-county-and-state
>>>
>>> For usa fact data is the https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/
>>>
>>>
>>> 3. Did you attempt to contact the sources of the data to figure out the reasons for discrepancies?
>>>
>>> I haven't contacted the source yet. As you mentioned, usa fact claims data from local government, but state possible discrepancy due to update frequency. My main concern is the level of the discrepancy is surprising. In addition, similar behaviour exists in other sources as well. In particular, at the local government website (https://www.icphd.org/health-information-and-resources/healthy-facts/covid-19/covid-19-data/) there is some discrepancy within itself, as data is being updated. I have tried to contact them, but haven't got anything back so far. We can chat more on that as well.
>>>
>>> Thanks,
>>> Tingting
>>>
>>> On Fri, Apr 2, 2021 at 7:40 AM Jacob Barhak <jacob.barhak at gmail.com> wrote:
>>>> Thanks Tingting,
>>>>
>>>> Your email is about data consistency in another location, not necessarily about Singapore data - so I started another email thread.
>>>>
>>>> Just to clarify to the readers, you found 2 data sources with different numbers.
>>>>
>>>> Let us examine the issue here and I have a few questions:
>>>>
>>>> 1. Were those different infections / hospitalizations numbers?
>>>>
>>>> 2. Can you be specific and send the exact link to the data you used? I saw many links in your first link.
>>>>
>>>> 3. Did you attempt to contact the sources of the data to figure out the reasons for discrepancies?
>>>>
>>>> The USA facts website states:
>>>> "they may not reflect the exact numbers reported state and local government organizations"
>>>>
>>>> So perhaps you just stumbled on some data that will be fixed later.
>>>>
>>>> I am being cautious before jumping to conclusions. This has to be studied in more detail to reach conclusions. However, I see your point that the data consistency issue is confusing at the least.
>>>>
>>>> I will set up time to meet in private email.
>>>>
>>>> Thank you for drawing our attention to another case of potential data issues.
>>>>
>>>> Jacob,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 2, 2021 at 12:56 AM Tingting Tang <ttang2 at sdsu.edu> wrote:
>>>>> Hi, Jacob,
>>>>>
>>>>> I create this figure using the data from the websites I mentioned. They are numbers of new cases per day reported by these websites. I also noticed that different websites sometimes have different meaning for "daily new cases" which makes the matter even more confusing. The following website contains this image https://www.notion.so/Two-websites-with-consistent-data-where-one-draw-from-the-other-2e54d94d9d474c36837cb48327963ba7
>>>>>
>>>>> I'd be happy to have a video chat sometime about the credibility of data.
>>>>>
>>>>> Thanks,
>>>>> Tingting
>>>>>
>>>>> On Thu, Apr 1, 2021 at 9:02 PM Jacob Barhak <jacob.barhak at gmail.com> wrote:
>>>>>> Hi Tingting,
>>>>>>
>>>>>> Did you create those plots?
>>>>>>
>>>>>> It would be very interesting to start another discussion topic at the credibility mailing list and see how many more people noticed differences between data sources.
>>>>>>
>>>>>> However, the maling list will reject archiving images and large files - its an old malign list tool we are using.
>>>>>>
>>>>>> Nevertheless, if you have a link to this image stored elsewhere accessible like google drive, it would be nice to share your experience with the working group.
>>>>>>
>>>>>> I was looking at your plot and data sources and was wondering if you are showing hospitalisation data or diagnosed data?
>>>>>>
>>>>>> It seems that data needs interpretation - Lucas and I are working on this aspect - if you are interested you can join the effort - I am looking for experts to interpret data from a human perspective to add to models. If this interests you, let me know and we will schedule a video call so I can better explain.
>>>>>>
>>>>>> Meanwhile, thank you for your email and it will be nice if you share this with the entire group.
>>>>>>
>>>>>> Jacob
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 1, 2021 at 6:53 PM Tingting Tang <ttang2 at sdsu.edu> wrote:
>>>>>>> Hi, Jacob,
>>>>>>>
>>>>>>> This example prompts me to link the credibility of data sources of some websites I have been watching. In particular, I have been checking the covid tracking data for imperial county, ca, for over a month at different websites: local government (icphd.com), usa fact, 1point3acres.com, california open data portal(https://data.chhs.ca.gov/dataset/covid-19-hospital-data) etc.
>>>>>>>
>>>>>>> There seems to be quite a bit of inconsistency with these data sources in case reporting. A quick glance of the comparison between california open data portal and the usa fact data which claims they draw data from the prior is shown below. You can ignore the labels as they are signifying the loosen and tighten of the local government regulations.
>>>>>>>
>>>>>>> If you see fit I can provide more information to add this as another issue with data consistency and credibility as well.
>>>>>>>
>>>>>
>>>
>>>
>>> --
>>> Tingting Tang
>>> Assistant Professor
>>> San Diego State University Imperial Valley
>>> Office: FOBE 110
>>> Phone: 760-768-5531
>>> 720 Heber Ave
>>> Calexico, CA 92231
>
>
> --
> Tingting Tang
> Assistant Professor
> San Diego State University Imperial Valley
> Office: FOBE 110
> Phone: 760-768-5531
> 720 Heber Ave
> Calexico, CA 92231
_______________________________________________
Vp-reproduce-subgroup mailing list
Vp-reproduce-subgroup at lists.simtk.org
https://lists.simtk.org/mailman/listinfo/vp-reproduce-subgroup
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.simtk.org/pipermail/vp-reproduce-subgroup/attachments/20210402/58fc7f35/attachment-0001.html>
More information about the Vp-reproduce-subgroup
mailing list