[Vp-integration-subgroup] Another case study of data

Jacob Barhak jacob.barhak at gmail.com
Fri Apr 2 12:29:04 PDT 2021


Thanks Tingting,

The resources in your first link cannot be downloaded:
https://data.ca.gov/dataset/covid-19-time-series-metrics-by-county-and-state

There is a server error I encounter. If you can send a direct link to the
spreadsheet data, it will help.

I tried to verify your data and could not download the files. I was able
to download a csv file using this direct link.
https://static.usafacts.org/public/data/covid-19/covid_confirmed_usafacts.csv?_ga=2.136905904.1065342744.1617373914-598050905.1617373914

Unless I can download both files, I cannot verify the problem you
encountered. Perhaps the data curators figured out the issue and are fixing
it?

Mistakes can happen, yet usually those are dealt with proper announcements.
I wonder if this is the case here.

If the source of data is not responding when complaints are raised, this
indicates a problem. Hopefully it is just temporary. and things get
resolved quickly by the time we talk.

Regardless, thank you for pointing out difficulties you had. It is
important more people realize the day to day difficulties modelers
encounter.

              Jacob







On Fri, Apr 2, 2021 at 1:53 PM Tingting Tang <ttang2 at sdsu.edu> wrote:

> Hi, Jacob,
>
> For the three questions
>
>
> 1. Were those different infections / hospitalizations numbers?
>
> These are daily cases numbers from different websites.
> 2. Can you be specific and send the exact link to the data you used? I saw
> many links in your first link.
> For the CA data open portal data, the source website is
> https://data.ca.gov/dataset/covid-19-time-series-metrics-by-county-and-state
>
> For usa fact data is the
> https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/
>
>
> 3. Did you attempt to contact the sources of the data to figure out the
> reasons for discrepancies?
>
> I haven't contacted the source yet. As you mentioned, usa fact claims data
> from local government, but state possible discrepancy due to update
> frequency. My main concern is the level of the discrepancy is surprising.
> In addition, similar behaviour exists in other sources as well. In
> particular, at the local government website (
> https://www.icphd.org/health-information-and-resources/healthy-facts/covid-19/covid-19-data/)
> there is some discrepancy within itself, as data is being updated. I have
> tried to contact them, but haven't got anything back so far. We can chat
> more on that as well.
>
> Thanks,
> Tingting
>
> On Fri, Apr 2, 2021 at 7:40 AM Jacob Barhak <jacob.barhak at gmail.com>
> wrote:
>
>> Thanks Tingting,
>>
>> Your email is about data consistency in another location, not necessarily
>> about Singapore data - so I started another email thread.
>>
>> Just to clarify to the readers, you found 2 data sources with different
>> numbers.
>>
>> Let us examine the issue here and I have a few questions:
>>
>> 1. Were those different infections / hospitalizations numbers?
>>
>> 2. Can you be specific and send the exact link to the data you used? I
>> saw many links in your first link.
>>
>> 3. Did you attempt to contact the sources of the data to figure out the
>> reasons for discrepancies?
>>
>> The USA facts website states:
>> "they may not reflect the exact numbers reported state and local
>> government organizations"
>>
>> So perhaps you just stumbled on some data that will be fixed later.
>>
>> I am being cautious before jumping to conclusions. This has to be
>> studied in more detail to reach conclusions. However, I see your point that
>> the data consistency issue is confusing at the least.
>>
>> I will set up time to meet in private email.
>>
>> Thank you for drawing our attention to another case of potential data
>> issues.
>>
>>            Jacob,
>>
>>
>>
>>
>>
>> On Fri, Apr 2, 2021 at 12:56 AM Tingting Tang <ttang2 at sdsu.edu> wrote:
>>
>>> Hi, Jacob,
>>>
>>> I create this figure using the data from the websites I mentioned. They
>>> are numbers of new cases per day reported by these websites. I also noticed
>>> that different websites sometimes have different meaning for "daily new
>>> cases" which makes the matter even more confusing. The following website
>>> contains this image
>>> https://www.notion.so/Two-websites-with-consistent-data-where-one-draw-from-the-other-2e54d94d9d474c36837cb48327963ba7
>>>
>>> I'd be happy to have a video chat sometime about the credibility of data.
>>>
>>> Thanks,
>>> Tingting
>>>
>>> On Thu, Apr 1, 2021 at 9:02 PM Jacob Barhak <jacob.barhak at gmail.com>
>>> wrote:
>>>
>>>> Hi Tingting,
>>>>
>>>> Did you create those plots?
>>>>
>>>> It would be very interesting to start another discussion topic at the
>>>> credibility mailing list and see how many more people noticed differences
>>>> between data sources.
>>>>
>>>> However, the maling list will reject archiving images and large files -
>>>> its an old malign list tool we are using.
>>>>
>>>> Nevertheless, if you have a link to this image stored elsewhere
>>>> accessible like google drive, it would be nice to share your
>>>> experience with the working group.
>>>>
>>>> I was looking at your plot and data sources and was wondering if you
>>>> are showing hospitalisation data or diagnosed data?
>>>>
>>>> It seems that data needs interpretation - Lucas and I are working on
>>>> this aspect - if you are interested you can join the effort - I am
>>>> looking for experts to interpret data from a human perspective to add to
>>>> models. If this interests you, let me know and we will schedule a video
>>>> call so I can better explain.
>>>>
>>>> Meanwhile, thank you for your email and it will be nice if you share
>>>> this with the entire group.
>>>>
>>>>              Jacob
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 1, 2021 at 6:53 PM Tingting Tang <ttang2 at sdsu.edu> wrote:
>>>>
>>>>> Hi, Jacob,
>>>>>
>>>>> This example prompts me to link the credibility of data sources of
>>>>> some websites I have been watching. In particular, I have been checking the
>>>>> covid tracking data for imperial county, ca, for over a month at different
>>>>> websites: local government (icphd.com), usa fact, 1point3acres.com,
>>>>> california open data portal(
>>>>> https://data.chhs.ca.gov/dataset/covid-19-hospital-data) etc.
>>>>>
>>>>> There seems to be quite a bit of inconsistency with these data sources
>>>>> in case reporting. A quick glance of the comparison between california open
>>>>> data portal and the usa fact data which claims they draw data from the
>>>>> prior is shown below. You can ignore the labels as they are signifying the
>>>>> loosen and tighten of the local government regulations.
>>>>>
>>>>> If you see fit I can provide more information to add this as another
>>>>> issue with data consistency and credibility as well.
>>>>>
>>>>>
>>>
>
> --
> Tingting Tang
> Assistant Professor
> San Diego State University Imperial Valley
> Office: FOBE 110
> Phone: 760-768-5531
> 720 Heber Ave
> Calexico, CA 92231
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.simtk.org/pipermail/vp-integration-subgroup/attachments/20210402/8919e0fd/attachment.html>


More information about the Vp-integration-subgroup mailing list