Monday, 11 December 2017

On the issue of data verification

My last couple of posts on data verification and LERCS have generated useful threads on the NFBR Facebook group. Writing these entries has been/is a way of generating some thinking and possibly teasing out issues that I hope will get people thinking and maybe get the shakers and movers in NFBR and LERCS to think in a somewhat broader manner.

I'm sure there are plenty of people who are not directly involved in the LERC/NBN network of professionals that are unaware of the relationship between the two, and between them and the recording schemes and vice-versa. That is probably especially true for the many specialist recorders who are mainly interested in their subject area and not in what happens to their records (apart from making them available in as convenient a manner as possible - to them).

I recall in the old ISR days that Pete Kirby once totted up all of the recording schemes that he would have to contact in order to submit his records on a yearly basis. It ran into several dozens (Pete is one of the World's great polymaths). The same holds if you spread your wings and record across many counties - having to split data up and submit directly to each LERC is also a chore that many don't want or find a disincentive to submit data. My own excuse for not submitting data in this way is that it is such a big job that I would rather place my data directly with the NBN - but of course it is unverified! Most of it will be OK but there will be glitches amongst families where I am less familiar.

So that brings me to the issue of how do you verify data at a gross scale? Unless you go through every single record and voucher specimen you can never be quite sure whether the data are trustworthy. And, even when you do, mistakes will occur. So, one needs a screening process based on whether somebody is known to the major specialists in the local area or nationally.

I would split records from a single individual into a hierarchy:
  1. Groups with which they are most familiar and therefore least likely to make mistakes
  2. Groups that they look at intermittently and are more likely to make mistakes
  3. Groups that they rarely look at and probably only note if something has caught their eye (e.g. something charismatic and unusual).
At a gross scale the main problem area is likely to be those groups that they look at only intermittently - these are therefore where I would look to see whether the records fit with what I know about the species' biology, biogeography and phenology. In my own case, I would like to think my Syrphid records are reliable (ish), but I would definitely not accept my Tachinid records unless they are either the very abundant and obvious species; or unless they are supported by a voucher specimen. For families I rarely do much with, I hold fairly extensive voucher series that are being steadily munched by 'the beetle'.

So, where can the LERCS start when they get a set of diaries from a recorder? 

I would start with 'what do we know about this person?' If they are a well-known and respected specialist then the chances are that their records are reliable and can be entered without too much concern provided they don't stray too far from the areas where they have known expertise.

But, if they are an unknown entity then you perhaps need to check further before entering records. If you don't know them, do the shakers and movers in the groups they mainly covered know them and have a feel for their reliability? For example, I have on several occasions happened across people who I did not know but had heard of - in the field making snap identifications on species that cannot be done without microscopic examination. They produce lots of data and a LERC might think they were a real expert, but in reality a great deal of their data is junk!

One of the critical issues with data is that the entries made by the recorder might have been fine at the time they made their record, but the taxonomy has moved on - but were they aware of it? If they worked in isolation, were unknown to other specialists and never corresponded with others then I would raise a big question-mark? If their records were made before a major split and they died before that split occurred, you cannot be at all sure what they actually recorded. Some very simple early validation is therefore possible - are the data taxonomically up to date? If not, seek the advice of recording scheme organisers.

So, the next question must be - it is fine to have volunteers entering data - very necessary, but before those data are passed to the volunteer for entering (or paid staff member), somebody needs to evaluate the likely reliability of the data and the problems that might be encountered. The older the data, the more likely it is to have glitches, either because the taxonomy has moved on, or because the keys were more challenging and open to misinterpretation. As such, these data should be flagged as requiring verification BEFORE being made publicly available.

Thus, we hit the usual problem - there is a vast backlog of data to assimilate, and a very small number of people who can make a reliable judgment as to the veracity of the records, so there is a need to establish screening and prioritisation. Digitising data from a national figure must take priority over the scrappy notebooks of somebody who was unknown, worked alone and may or may not have produced reliable records. If in doubt, contact the national schemes to see what they know about somebody and whether they can point to potential problem areas.

No comments:

Post a Comment