A critical appraisal of a new paper on “big data and the future of ecology”

By Joern Fischer

“Simply put, the era of data-intensive science is here. Those who step up to address major environmental challenges will leverage their expertise by leveraging their data. Those who do not run the risk of becoming scientifically irrelevant.”

Hampton et al., 2013, Frontiers in Ecology and the Environment (p. 158)

I just read Hampton et al.’s new paper in Frontiers in Ecology and the Environment, entitled “Big data and the future of ecology”. In a nutshell, the paper encourages ecologists to more routinely share their data. The underlying premise is that data sharing will lead to bigger and better (or at least additional) insights, because there are large amounts of small datasets that – if widely shared – would allow more effective quantitative analyses using lots of those small datasets in a big way. Other disciplines, according to Hampton et al., are ahead of ecology in sharing their data – among ecologists, only geneticists share their data widely (partly because they have to), while many others don’t.

Several journals have now made it a requirement to share data (unless there are strong reasons why you can’t), e.g. Proceedings of the Royal Society London B and Journal of Applied Ecology. What’s going on here? Is this an obvious case – so much more could be gained if only we all had access to more data?

That, it seems, is what Hampton et al. genuinely believe. They suggest there are four things we ought to do:

  1. Organise and preserve our data for posterity, no matter how small the dataset, including appropriate meta-data.
  2. Share data through publically accessible databases.
  3. Collaborate in networks where data are shared, e.g. to combine the insights of multiple case studies.
  4. Address issues of data management with students and junior researchers in your labs.

I immediately agree with points 3 and 4. One my recent posts in this blog was about the PECS network, for example – which is exactly about the kind of thing raised in point 3. It is a network of people who each do local-scale studies, but would like to see their findings synthesized in a useful way.

I kind of don’t have much of a problem with point 1, but I’m not terribly convinced about point 2. I see the following issues with a generic “you ought to share your data”:

  • I think there is a misunderstanding that “big data” is what is needed to solve today’s problems. From data, we need to get to information that is usable; from their to analysis and insight; and from there to wise societal decisions. I would argue that if there is one problem we DON’T have in our modern world, it’s a lack of data! I would argue the opposite in fact: that the ever-increasing availability of data is blinding us from the real problems. It looks as if additional data would somehow help – it’s an enticing prospect to have all this data! Wow! But as I argued in “Human behavior and sustainability” (also in Frontiers), a lack of data, information, or knowledge is not the problem for sustainability. We know well enough what we ought to be doing; we lack the means of putting our knowledge (based on information, based on data) into action.
  • I think there is a serious risk that data is misinterpreted if used by others who are NOT explicitly chosen collaborators in a network. This is not a matter of meta-data. It’s a matter of ecological field data coming from places, and being appropriately understood only if one understands the place. That is why Discussion sections of journal articles aren’t auto-generated once you have written the Results, but require (subjective!) expertise. Meta-analyses channel our focus towards questions that can be asked, not towards questions that must be asked. There is a real risk that we search for universal truths across study systems, at the price of glossing over local details that are fundamentally important. A simple example is what constitutes a “patch”. This is assessed differently in different parts of the world. Just using people’s data on “patches” could lead to serious misinterpretations about many things, including patch-size-effects (for example). I am critical of many existing meta-analyses for this reason already – having all data available to everyone, to my mind, will simply increase this trend away from deep, locally based ecological knowledge.
  • Following on from the previous point, what happened to the argument by Lindenmayer and Likens on losing the culture of ecology? Ecology is about places, just like geography and anthropology are. Good ecologists go in the field and learn about life there; they develop an ecological intuition, which is the only way to stop them from writing nonsense in their Discussion sections. I am deeply concerned that a trend towards yet more data will even further erode the field-based culture of ecology. Yet more PhD students will make their careers out of modeling, rather than going in the field.
  • Finally, this raises important ethical issues. Modeling experts will then “own” top journals like Ecology Letters, and (I hope not) Frontiers in Ecology and the Environment. But none of those will be field ecologists!… those, in the meantime, have to publish their work in “regional journals” (i.e. not widely read ones) because their stuff is less relevant. Basically, they had to spend months in the field for someone else to get a free ride out of it in a more esteemed journal.

I’m all for addressing big questions. I’m all for synthesis, though I believe much of the societally relevant stuff will be qualitative not quantitative. I’m all for sharing data with the right people for the right reason – but I do not believe that universal sharing either is a safe recipe towards a better science of ecology, nor do I believe that a lack of data is in fact the primary problem we face today. And universal sharing does have risks of data being used wrongly by others, and some taking a free ride on the backs of field ecologists.

Big data? Sure, it can be a part of what ecology does, too. But I found that Hampton et al. were far too one-sided about this issue, essentially seeing no downsides or limitations.

Finally … (deep breath), this is an issue I may yet change my mind on. For now, I don’t buy the arguments put forward, but undoubtedly I will be confronted with this over the next few years again and again (say because I want to publish in one of the journals requiring data sharing!) … so who knows, I may yet change my mind. It’s worth putting the issue on the table, and Hampton et al. have done that nicely. As I said, some of their points and conclusions I agree with – but some I don’t, and so overall, I’m a lot less enthusiastic about big data than they are. According to the quote above, my skepticism towards big data will render me scientifically irrelevant in the near future… I can’t wait.

I’d be really interested in other people’s comments on this!

22 thoughts on “A critical appraisal of a new paper on “big data and the future of ecology”

  1. Hi Joern – I completely agree with you, thanks for this post and the nice way of putting the problem. I favor fieldwork, reflection, local identity (ecological, cultural), embeddedness in the social-ecological system and so on – as you and others do. This is even healthy (good to be in touch with people, nature…)! Large – peta, mega and beyond – datasets are wellcome, we need them of course. But I am afraid that this will be a fashion. The goal is not to analyse and re-analyse data. The goal is to be wiser and help this world somehow. Will see what will happen.

  2. Hi Joern, thanks for another interesting viewpoint. There was a really interesting debate on Twitter last week on the ethics of authorship on meta-analysis papers. The discussion covered a small part of the issues you raise but is interesting and relevant nonetheless. Brett Favaro did a great job collating all of the tweets and posting them into a couple of story boards on Storify. The discussion can be seen at this link (readers don’t have to be on twitter to view it): http://storify.com/brettfavaro/authorship-in-the-era-of-big-data Cheers Ian

  3. I think the issue of data sharing and its growing emphasis reflects the changing culture that you’ve mentioned in another way; moving from carefully designed or stratified experiments (including natural experiments, observational studies etc.) to dredging enormous datasets for trends. This move means that the intellectual property involved in experimental design is less appreciated/overlooked, particularly by people who don’t design or undertake experiments (particularly large-scale field experiments) themselves.

    • I always try to look to the deeper levels (which also means that my interpretations may be wrong or useless too): the growing emphasis may mirror a broader and increasingly actual phenomena. That is, the researcher population is going up while the possibility for fieldwork goes down. In many cases (e.g. the corncrake, black stork, big trees, yellow bellied toad etc.) there are fewer individuals in the so called nature than ecologists to study them.

      It is like with the predator and prey: researcher brains are data predators. We need data, information, to exist. Nature cannot support our growing population with informations anymore. So the published data start to represent a new feeding substrate for us: we adapt to exploring them with new analythical mechanisms, new ethics, new normatives etc. In the same time the last ‘pristine’ regions of the planet are ‘devored’ by us (the data predators).

      The story seem to be not very different from any natural predator-prey system.

      If I am right, then there are deeper forces shaping our behaviour. And we cannot do much to stop it. That is why I told that my observations unfortunately are useless. But meybe interesting to think about in a morning coffee and then just forget them?:))

      • We are now creating that ethical/methodological ground and framework to justify our existence as data re-analyzers. This situation will be clearly favoured by some people (e.g. those, who have virtually no access to nature etc.). I will not favor it – many protected species are basically in my yard in Romania so I am (and want to be) ‘busy’ with studying them and I am grateful for being able to be a semi-classical field ecologist. Two opposing realities. Who will win? Ts there any chance for the middle way in this growing inbalance?

      • but the question is, will you and should you share all your data with people, including ones you don’t know what they will do with it?

      • What deeper forces shape our behavior that we cannot do much to stop? Very little of our behavior are fixed-action-patterns; we have choices (insofar as one believes choices exist–http://iamj.blogspot.com/2011/03/placeholder-post-free-will-is-undefined.html) and “shaping” behavior is a far cry from “determining” behavior.

        Besides which, the analogy with predator-prey systems is interesting but, I think, not the best model. Predation is consumptive; data use probably is consumptive to some extent, but nowhere near the extent of predation. I deeply suspect this fundamentally changes the dynamics, even though similar issues may, at various points, adhere.

        “The price of metaphor is eternal vigilance” 😉

  4. It is telling that the paper is completely authored by scientists from the United States – a country that has had a long tradition of relatively “open” access to multiple data sets via either citizen science enterprises, or via governmental regulation (since the vast majority of budget for ecological work is through federal and state funding). Once >30 years of data has been accumulated in several facets of ecology, I suppose it gets easier to begin pushing for meta-data analyses. In most tropical and sub-tropical countries, we are rather far from understanding even very basic aspects (e.g. distribution of a species), let alone have information suitable for meta-analyses as suggested by Hampton et al. Given the enormous difficulties in funding for projects focusing on pure ecology and natural history, it is natural that most scientists here would not even consider putting up their data for free anywhere.

    I agree with Joern on the point of data not being amenable to analyses by just anyone. Context and ability to interpret correctly is very important, and a meta-data bank does very little to provide these additional, but crucial, aspects.

    The problem of ecologists becoming more comfortable with open-access software and vast data sets is sadly growing rapidly everywhere. Even in countries like India (where I work) that has spare little information for most things, students now send out emails requesting for all sorts of information for their dissertation work. This trend, combined with the reality that there is indeed a huge paucity of data in a large number of landscapes, is not greatly aided by calls for sharing data, and celebrating meta-data analyses.

    (The comments are not to be construed to mean that I am unaware of the benefits of meta-data analyses.)

    Thanks for a thoughtful blog Joern, and for flagging this subject/paper.

    Gopi Sundar.

  5. Ecological modelling does indeed have celebrity status at present. I have no problem with that per se. However, from my field experience in South Africa, many of these maps, richly layered in beautiful colours, may easily oversimplify a landscape. I believe oversimplification is a relic of pooling data out of context. You mentioned that the vagaries of sampling methods in ecological research is well known, and I agree that this alone is cause for concern.

    Maybe off the topic, I also experience that many land-users (for example farmers in production landscapes) are more interested in the 3-D organism, and how it responds in the landscape, than the 2-D map. However, they are complementary.

    Render field ecologists scientifically irrelevant? I will still sit and silently observe our rich natural heritage, and chances are someone in the area will be interested to learn more.

    Thank you, Joern, for a highly thought-provoking blog.

    • Thanks for the comment! I guess, to be fair to the original paper I criticised — … they didn’t say field ecology was “bad”, just that we should be sharing our data so that big modelling exercises become possible. And, well, yes, they did imply that that was the way forward … Anyway, thanks for your thoughts and the discussion!

      • hi Joern – I respond here to your question: ‘But the question is, will you and should you share all your data with people, including ones you don’t know what they will do with it?’

        It will depend. My current obligation is to publish papers, I am researcher, and the paper is the outcome my funders want to see. In the past I had a contract where they obligated me to not share data but I can publish if I acknowledge. So, I will not share those data. I shared data already for a meta-analysis. In that specific case (the researcher being outside of Europe!) they told me exactly what they want to do with the data. I shared my data with them because I indeed am curious how ‘my landscapes’ look in a global context. It is nice. They have my paper so they can read my local wisdoms from the discussion – and I hope they do that. One co-author was not happy with this data sharing but let me do what I want basically. If more would be unhappy, I would of course not share the data. If I would not know the person, I would probably not share my data (in my previous case, it was a landscape ecology VIP and I respected and respect his/her work very much).

        Regarding misinterpretation and the need of local feeling, embededdness etc: the referees, even in regular papers, would be anyway so far away from some local realities than they will look to some fancy words, fancy framing and stats and ulikely to the shining local wisdom. In this way good looking papers can be published which are at local scale (i.e. according to local wisdom, knowledge) perceived useless or interesting but not explosive, not new.

        I just hope this big data thing will not be a fashion (I am afraid it will be – my concerns are coded in my comments above) because if it is, then an old story will repeat in a new dress, and nothing new will happen under the sun.

        Nice greetings,
        Tibi

  6. I’m delighted that our paper sparked such a lively conversation! I was in the field without much internet connectivity when it started. The concerns expressed about sharing data are not uncommon, and I empathize with most (but not all) of them. A few points…

    First, if you missed it, there was a paper by Cliff Duke and John Porter that makes some recommendations about co-authorship and ethics in data re-use – they posted an open access version here: http://www.vcrlter.virginia.edu/~jhp7e/reprints/bio201363610_ProfessionalBiologist_Duke.pdf

    It’s up to us as a scientific community to build a culture around data re-use, a culture that has well-established ethics that we can all live with.

    Second, I’ve been pretty outspoken over the years about the need for biologists to maintain a deep understanding of their natural history even as they improve their computing and quantitative skills. The most prolific “data synthesis scientists” I know are avid field biologists and experimentalists.

    And third, it’s true that this group of authors is North American…I agree that there are cultural and contextual differences that make the vision of open access more difficult to achieve in other regions of the world.

    Thanks, Joern, for opening this discussion! I really appreciate everyone’s thoughtful contributions in this thread!

  7. Hi Stephanie,

    Thanks very much for responding — and for doing so in such a graceful way. Exchanges like this are one of the main things that make blogging worthwhile!

    I had indeed missed the paper on the ethics of data sharing, so thanks for highlighting it. I will read it with great interest. (Turns out it’s the second important BioScience paper I learnt about in a week that I missed — but I’m learning … I have just created table of contents alerts for BioScience, which somehow had slipped my attention previously …)

    I do think there’s a little bit more to this though that needs a bit more thought. I sympathise with many of the arguments as to why one ought to share data — but I’m typically in the position of someone being asked for data, not of someone benefiting greatly from the resulting synthesis. Depending on your personal experiences with this issue, I guess biases can go either way (as so often).

    So, personally, my key concerns are that:
    – data ends up used in “bad ways” because people do not truly understand the context of the study system;
    – “big data” provides an illusion that we can now synthesise all kinds of wonderful things with great use to humanity, when really, retro-fitting data to questions that the studies were not originally designed for often does not work elegantly, and many problems have nothing to do with a lack of data;
    – unlike in medicine, we have few controlled experiments where strict “meta-analysis” is possible;
    – the signals are clear: if you model global food systems, pollination systems, you-name-it systems, you end up in Ecology Letters and Nature, whereas if you actually understand an ecosystem deeply, you end up in Biological Conservation at best. There is a complete mismatch in incentives, and uncritical data sharing will further exacerbate this trend — people will try to mine data wherever they can, whether it’s useful or not.

    In summary, I believe that not all things are meant to be of use to purposes they were not designed for. Categorically assuming that all things ought to be of potential use to someone, somewhere, is, to my mind not useful because it distracts from many real-world problems (which are regional and unrelated to a lack of data); and exacerbates disincentives to do local-scale fieldwork.

    So, for now, I disagree, but I do agree that it is worthwhile to have a discussion about this — it’s a new issue, and for all of us this means our views may yet reach a stable equilibrium!

    Thanks again for your engagement and for sparking this discussion with your paper,

    Joern

  8. Joern, thanks for a well written and provocative position. As you say in the comments, these kind of discussions really let these issues come to life. I think you raise some important points that clearly resonate with others.

    Personally I feel like there is a bit of a false dichotomy being drawn. I don’t see the position of Stephanie and co-authors as proposing a particularly radical change but rather one that most ecologists from a half century ago would easily recognize — data is a part of the publication. Likewise most journals require this already. Somewhere when data got to long to fit in a table in a printed journal, we stopped providing it. Would it be okay if I left out a description of my modeling or statistical method for fear that others would steal or misuse it?

    I may have misunderstood some of your points, so to address them one-by-one:

    1. “We don’t need more data”: Surely you can’t believe that. You, Stephanie, and others all eloquently argue for the need for field biologists going out and collecting more data.

    2. “Misuse”: sure, but this is a risk of publishing too, isn’t it (see the lively debate in the pages of Nature over whether the H1N1 mutants paper could be published)? Or perhaps a more obvious example, a risk of publishing mathematical or statistical methods, whose misuse is widely bemoaned. I don’t see limiting access as the solution to miss-use, but rather that the benefits must outweigh the costs. Sure, restrict use and you’ll always decrease misuse too. But does misuse of the data you have in mind really justify the lost potential use? (It didn’t in the case of the deadly infectious virus…)

    3. “Ethics / Modelers dominating journals”. Of course field biologists perform interesting analyses, they don’t just collect data. Actually I think they are in a better position to capitalize on improved access to data than most modelers by drawing on their familiarity with the context and challenges of real data. Pursue recent issues of Theoretical Ecology, Journal of Population Biology or Journal of Mathematical Biology and I think you’ll still find most modeling papers have only cursory connection to data, if at all; but instead make their way through elucidating logical arguments through mathematics. Meanwhile the analyses of most field researchers are far from trivial, and their insights go far beyond just providing data. So what do they have to lose?

    I see Hampton et al’s position as incremental to the scientific framework just as a matter of scale. Science is awesome, comes up with awesome stuff. But just as it would be silly and slow down science if we avoided distributing publications in a digital format native to the way we access and read information today, it is silly not to distribute data in in an intelligent digital format that we actually access and use rather than in a manner that is historical artefact. Sure, either way _works_, but the digital solution is faster and scales to more researchers. Yeah, I’m sure it has downsides — so does digital publishing replacing print journals. But I think it’s a pretty clear case that the benefits outweigh the costs.

    Sorry for the long reply. In short, I see this as a set of best practices that will make existing science run more smoothly. As datasets gain more recognition as independent entities from papers (because the data doesn’t fit in the paper, so it’s become a separate entity already), I think data producers are likely to benefit, as some evidence already shows (genome papers?). The same arguments could be made of software, which captures methodology that has outgrown the printed paper. But the basic idea would be familiar to Darwin or Galton or MacArthur.

    • Thanks Carl! Before getting bogged in disagreements, we agree that discussing this is worthwhile. I think many of these disagreements arise not because one thing is right and one is wrong, but because people’s different positions and experiences differ, and therefore the arguments that appear salient to them.

      It’s hard to respond to this stuff without everything getting longer and longer and more and more confused, so I’ll try to be precise.

      1. we don’t need more data. This would require, first a definition of what the objective is. Mine is advancing sustainability. For that, I think we need a lot of insight, we need to close the knowing-doing gap, and we need synthesis and integration between disciplines and between science and the rest of the world. I value data mostly in a problem-solving context, less so in a discovery context. (I’m not saying here that discovery is worse, but my personal bias partly explains my position, I think.) What honestly concerns me is that we may get lost in an illusion that more data is the *primary* thing that we need to solve sustainability or conservation problems. For example, we need a better evidence base, etc. I think that is just not true. Years ago, Ecology Letters rejected the review David Lindenmayer and I had written on fragmentation, partly because it contained no data synthesis. It was, instead, a qualitative synthesis, and in hindsight, apparently worthwhile enough for it to get cited once a week or so (now published in Gl Ecol and Biogeog). So, there’s nothing wrong with quantitative synthesis or meta-analysis, or evidence, but I take issue with this kind of analysis displacing everything else. Take Stephanie’s quote I used above, that people who don’t share data render themselves irrelevant. That to me is provocative, but entirely over the top. Big ideas and qualitative synthesis shape the world just as much (or more) than big data does. So my issue is not with big data, but with over-emphasising big data at the expense of other activities.

      2. Misuse. I think this issue is real. As a field ecologist (not lab ecologist), I believe I can interpret my data well because I have field intuition, not just data. Others will have the data, not the intuition. Again, I do not say that nobody can ever make use of data they have not collected themselves. But to assume the opposite — that all data is useful and that no expert knowledge or intuition is needed to analyse it — misses huge amounts of what ecology is inevitably about.

      3. Given my belief in good ecologists needing to be grounded in field reality, I do take issue with more and more statistical or mathematical modelling getting grants, and getting published in “top” journals. It causes incentives that are unhelpful for the next generation of ecologists. There are now many people who are “good at R” without being either particularly good at ecology nor at statistics. This is not a direct consequence of big data, but again, over-emphasising how wonderful big data is will further exacerbate this trend.

      Don’t get me wrong: I believe in data sharing. Just not blindly, with anyone, for anything, in some kind of blind belief that this somehow adds tremendous value to what we do. I believe in qualitative synthesis just as much as in quantitative synthesis, and I believe in good regional-scale work just as much (and quite possibly more) as in continental scale data. All of this is because of my bias towards wanting to solve real-world problems; and I believe those can often be tackled at regional scales.

      Synthesis of multiple regional scale case studies, then, is something I strongly favour — the PECS network I mention above is doing exactly that.

      So, in summary: I’m not against data sharing; but I am against making this compulsory. I disagree with you that there are obvious benefits. I think through the shifting incentives provided by emphasising big data above other things, it is not at all clear that the net benefits for conservation/sustainability will outweight the net costs.

      Last comment: I think this dichotomy, while it may be less extreme than it can sound when one discusses it, is actually related to the one I mention in the post on changing values in conservation science (Aug 22nd 2013). I have no problems with data, but I do not believe that we should reduce real-world problems to “data” because there is a lot more to it.

      And like with all things actually worth thinking about, there is not need to reach agreement on everything — but it’s good to learn about whwt might cause disagreements!

  9. Pingback: Going big with data in ecology | Jabberwocky Ecology | Weecology's Blog

  10. Very well argued, I totally agree! I am also worried of the direction of modern ecology into big data modelling priority.

Leave a comment