Feed on
Posts
Comments

Provenance and Trust

With today being Columbus Day (observed), I started thinking about all the myths surrounding this famous Italian. For example, was Columbus really Italian? Did you know that he did not set out to prove that the world was round? One might pose the question, “How can you discover a country when there are people living there?” For that reason, many folks have started to celebrate Native Americans’ Day. Now I am not here to tell you that your favorite school teacher, or even your sainted mom, lied to you. I am sure they provided you the facts as they knew them. The point is that information sources and integrity need to be scrutinized before accepting them as fact.

Trust Me, I’m from the Government

When bloggers fail to gain a level of trust through linking back to original sources, you should not trust a word that is said. Anyone can write a blog for a multitude of reasons. Why should you trust these anonymous people? Why should you trust me? How exactly are you handling valuable information that you encounter in a blog whose source you may not know or be able to trust?

In the world of blogging, consider George Siemens’ distinction between collective intelligence and connective intelligence. Collective intelligence is “a form of intelligence that emerges from the collaboration and competition of many individuals”. George defines connective intelligence as “individual creation of information, ideas, and concepts which are then shared with others, connected, and re-created and extended based on the interaction.” George goes on to state, “simply, collective means blending together. Connective means connecting while retaining the original (though others may build on it in their own spaces).” Put another way, “the collective presents a melting pot of ideas. The connective represents a mosaic of ideas.” Collective, provided there are enough people telling the truth and setting the record straight, will wash out incorrect information. Connective by retaining the original thought, and source, provides a degree of provenance and trust.

Concepts and Terminologies

The issue of provenance and trust is something security has been grappling with since the beginning. Some folks may be unfamiliar with the term “provenance.” The National Science Foundation defined provenance as:

Provenance refers to the knowledge that enables a piece of data be interpreted correctly. It is the essential ingredient that ensures that users of data (for whom the data may or may not have been originally intended) understand the background of the data. This includes elements such as, who (person) or what (process) created the data, where it came from, how it was transformed, the assumptions made in generating it, and the processes used to modify it.

Tim Mather, Chief Security Strategist for RSA Conference, posted “More on Data Integrity” where he explains, “For the vast majority of data, whether structured or unstructured, data lineage is sufficient. For scientific data, however, provenance is often required. For example, exactly how were the testing results of that new drug compound derived?” Tim goes on to make the point:

By now, after four years of SOx (for many companies in the United States), practitioners have a good understanding of data lineage – tracing relevant financial data through various applications within scope of the audit within the enterprise (or through 3rdparties’ SAS 70 Type II audits where required). This includes getting answers to such questions as where did the data originate? Where was it processed, stored, etc.? However, for other uses of data, “simple” data lineage is not good enough Some data requires further knowledge of its provenance (e.g., scientific data)

Scientific Research

It is interesting to take a brief look at some of the work being done in the scientific community where reliably reproducible results should be of paramount importance. Massive experiments are being carried out using computer systems with thousands of processors producing enormous amounts of data. This data needs to be captured, transported, stored, accessed, visualized and interpreted to extract knowledge. Jon Udell has written a post, “Trident: A workflow system for doing data-intensive science with reproducible results,” which discusses Trident. Trident is a “system for authoring, running, and tracking the provenance of scientific workflows — that is, sequences of computational steps that bridge the gap between the data produced by the Neptune sensor array and the COVE visualization system.” Roger Barga, a principal architect with Microsoft’s Technical Computing Initiative, describes Trident’s provenance capabilities as:

Think about it in terms of art. For a given piece of art, we’re able to establish through authorities that it’s original, where it came from, and who’s had their hands on it through its lifetime. Provenance for a workflow result is the same thing. Minimally we want to be able to establish trust in a result. If you think about how that happens, it often starts by considering who wrote the workflow. So with Trident you can click on a result and interrogate the history of the workflow: who wrote it, who reviewed it, who revised it, when it first entered the system.

We do versioning as well, so you can look at an old result and know that it was created by an old version of the workflow. And then have the ability to run the new version on the old dataset to see if it makes a difference.

We capture execution provenance so you know exactly how your result was created. We capture provenance on the workflows themselves so you know who created them, and who’s touched them.

You might be thinking about creating a community, where you click on a workflow and can say: “OK, I trust that post-doc.

In the area of networks, Wenchao Zhou , Eric Cronin, and Boon Thau Loo wrote the paper “Provenance-aware Secure Networks.” The paper examines network accountability and forensic analysis as a means of “performing network diagnostics, identifying malicious nodes, enforcing trust management policies, and imposing diverse billing over the Internet.” The paper:

  1. Shows how network accountability and forensic analysis can be posed generally as data provenance computations and queries over distributed streams.
  2. Proposes a taxonomy of data provenance along multiple axes, and show that they map naturally to different use cases in networks.
  3. Suggests techniques to efficiently compute and store network provenance, and provide an initial performance evaluation on the P2 declarative networking system with modifications to support authenticated communication and provenance.

New Architectures

Let us examine how provenance and trust relates to the relatively new IT architectures, such as cloud computing. Just for a little background, and because I really like the video, below are a few IT leaders at Web 2.0 Expo providing a great job discussing what they think cloud computing is:

Tim makes the point in relation to IT architecture:

First, computing resources needed for scientific purposes are often huge, and yet infrequently used. What company wants to maintain enormous computing capabilities only to have such used infrequently? That’s simply not cost efficient. So effectively ‘renting’ computing capabilities (e.g., from Amazon’s Elastic Computing Cloud – EC2) can be much more cost efficient. (Of course, this is the same usage model employed by national supercomputer centers for years – timesharing.)

The article “Data provenance in SOA: security, reliability, and integrity” adds some additional insight into provenance and security. The article states, “consider data provenance, which concerns security, reliability, and integrity of data as they are being routed in the system…In an SOA system, however, one also needs to consider origins and routes of data and their impact, i.e., data provenance.” Consider that SOA is just an architect where basically you operate similar to a distributed computing system. In the end, it is all about the data, making the same points applicable to a distributed environment.

Returning to the issue of trust. There are multiple factors that may affect the data trustworthiness. The whole Internet is grappling with this idea and how to assigns trust scores to both data and data providers. Such trust scores represent key information based on which data users may decide whether to use the data and for what purposes. The paper, “An Approach to Evaluate Data Trustworthiness Based on Data Provenance” proposes a “data provenance trust model which takes into account various factors that may affect the trustworthiness and, based on these factors, assigns trust scores to both data and data providers. Such trust scores represent key information based on which data users may decide whether to use the data and for what purposes.”

In the article, “On Homeland Security and the Semantic Web: A Provenance and Trust Aware Inference Framework” a different approach that attempts to discover and evaluates semantic associations of information provided by many different sources. The paper describes, “how trust and provenance can be represented/obtained in the Semantic Web and then be used to evaluate trustworthiness of discovered semantic association and to make discovery process effective and efficient.”

Final Thoughts

In this post we have discussed the ideas of provenance and trust. Everything old is new again. New IT architectures were related to these basic ideas to demonstrate that no matter how cutting edge the IT ideas might be, everything gets back to the basic concept of trust. One cannot trust any information unless one know who or what created the data, where it came from, how it was transformed, what assumptions were made in generating it, and what processes were used to modify it.

Walter Dykas, senior researcher at the Oak Ridge National Laboratory (ORNL), recently said to me:

Security comes down to protecting your infrastructure. To do so, you must:

  1. Enforce access rights at the lowest level possible.
  2. Secure the trust infrastructure.
  3. Implement assurance verification.

You can never have (1) without (2). If you have (2), (1) will follow. Finally, (3) is just watching the watchers. ‘Trust infrastructure’ is broad enough to cover technology and people. For example, an organization must have infrastructure for trusted communications and authorization, which again infers technology and people.

The The Open Group’s Jericho Forum agrees with Walter (see June’s Security Roundtable podcast for a good discussion on the group). The Jericho Forum argues that traditional network boundaries are disappearing in favor of complex online interrelationships that require more innovative security approaches. Deb Radcliff in her article “Information AND network protection: Finding the right mix“, explains how the group “advocates assigning priorities to data, focusing on the most critical areas, and applying secure communications and encryption around these classified resources.” Steven Bellovin, professor of computer science at Columbia University and co-creator of the Usenet online discussion system, summed it up in this way, “We need to think about the problem in a different way because what we’re doing [with perimeter protections] isn’t working. What we need is a more data-centric architecture with strong protections around the important data because security holes in the perimeter are inevitable.”

It is like my dear old dad would say, “You are not going to win any games if you don’t have the fundamentals down.” Of course he was talking about football, but the same rules apply to IT. Ori Brafman and Rom Brafman, authors of “Sway: The Irresistible Pull of Irrational Behavior” spoke at the Churchill Club with basically the same message. People due to fear and other motives move away from what they know are the fundamentals with disastrous results. ZDNet has posted this very interesting discussion.

One needs to keep focused on the fundamentals of protecting one’s infrastructure. Otherwise any attempts to implement the latest architectures and technologies is doomed to failure. In today’s world we are all interconnected and regrettably folks can be quite hostile. The Native Americans after Columbus’ landing learned this lesson the hard way. With that said, have a great Columbus and Native Americans’ Day.

3 Responses to “Provenance and Trust”

  1. Ragib Hasan says:

    You might also be interested in this:

    Ragib Hasan, Radu Sion, and Marianne Winslett, “The Case of the Fake Picasso: Preventing History Forgery with Secure Provenance”, USENIX FAST 2009.

    (the preprint draft will be available later this month).

    We propose a model for provenance security (i.e. integrity, confidentiality, and privacy). We also create a proof of concept prototype that works for file systems, with very little overhead.

Trackbacks/Pingbacks

  1. [...] few months back I did the post “Provenance and Trust” where I examined how provenance and trust relates to the relatively new IT architectures, [...]

Leave a Reply

Bad Behavior has blocked 573 access attempts in the last 7 days.