Home
About INCTR
Organization
Programs
INCTR AWARDS
Membership
Meetings
Newsletter
Publications
Helping
Helping
inctr contact us
inctr
publications

The President's Message

Information and Cancer

by Ian Magrath

Information is defined as the degree of uncertainty which is resolved by the arrival of a signal. Robert Escarpit, 1978.

The planet Neptune
The planet Neptune, as seen by Voyager II. Neptune was discovered as a result of deviations in the orbit of Uranus from the path predicted by Newton’s law of gravity - an example of scientific inference based on available information. Picture from NASA’s open source website.


Why the great English poet, Dryden, entitled his poem about the year 1666, “Annus Mirabilis” (Year of Wonders), might not be immediately apparent. Bubonic plague had been raging in London since 1665 and in September the Great Fire of London broke out. It was, though, something of a miracle that only 16 people were believed to have died in the conflagration. The fire also brought an end to the epidemic of plague through the destruction of the most overcrowded districts, providing, at the same time, an opportunity to rebuild the city. Dryden also made much of the final victory of the English fleet over the Dutch after narrowly avoiding total destruction. He made no mention, however, of the remarkable scientific discoveries, that same year, of Isaac Newton. Newton had been sent home from Cambridge University shortly after receiving his degree in 1665 as a precaution against the plague. Prior to his return to Cambridge in 1667, he laid the foundations of calculus, developed new concepts of light and color, and began to formulate his laws of motion and gravity. 1666 was indeed an annus mirabilis for physics. Not surprisingly, some 240 years were to pass before this feat was equaled. In 1905, Albert Einstein published four remarkable articles in the physics journal Annalen der Physik which laid the foundations for atomic theory, quantum mechanics and relativity, and set the course for much of 20th century physics. Einstein’s last article of the year showed the equivalence of mass and energy through a relativistic derivation of the famous equation E = mc², where E stands for energy, m, mass, and c, the speed of light in a vacuum.

This simple equation might well be held to support the view of the seventeenth century Dutch philosopher, Spinoza, who believed in the unity of all that exists and that God and nature are two names for the same reality or substance. His particular brand of pantheism appealed to Einstein who wrote: “I believe in Spinoza’s God who reveals Himself in the orderly harmony of what exists, not in a God who concerns himself with fates and actions of human beings.” We can, of course, directly perceive both energy and matter, but perception alone provides little or no insight into the laws of nature to which Einstein was referring. This can be achieved only by scientific inquiry, although such inquiry also has its limits. We are unlikely ever to understand, for example, the nature of the ultimate “substance” that can be expressed either as matter or energy, or why it exists at all. Remarkably, however, the best minds are able to convey profound truths through the use of mathematical symbols, models or theories, which serve to increase our understanding of natural phenomena. The validity of such representations of particular aspects of reality can only be determined through repeated observation and/or experiment coupled to analysis of the data collected. But the collection of information and its analysis, or the derivation of a mathematical equation, require the expenditure of energy, and storage in a material context is essential to avoid rapid dissipation. Matter, already a vast repository of information in its composition, form, temperature and stored kinetic and potential energy, can be used in forms ranging from clay tablets to optical discs to store additional information for subsequent retrieval - again, by the expenditure of energy. Information, then, appears inherent to all that exists; some have argued that all that exists is information - logos - manifested as matter and energy.

Redundancy and Symmetry
One of the central problems of information theory - the discipline concerned with quantifying data and minimizing the space/time required for its transmission and storage - is that of noise. Making oneself understood in a noisy restaurant or aircraft can be difficult, but it is possible to separate the message, or signal, from noise. This process is aided by the rules of syntax, which create expectations in the mind of the listener, and by the redundancy of language - the message can still be accurately received even if some of it is garbled since normal speech includes elements in excess of the minimal amount of information required for understanding. Similarly, in written text, words may be misspelt or letters omitted but the recipient can still discern the meaning in the message. Redundancy allows computer text files to be compressed by as much as 60% without loss of information - into “zip” files, for example, that take up less storage space. The removal of redundancy is also essential to the creation of secure codes, and, conversely, its presence is helpful in deciphering unknown languages and ensuring that a message is accurately conveyed. Visual communication also includes redundancy; familiar objects may be readily recognized even if partially obscured. Similarly, simple sketches convey a great deal of information and even “stick people” and “smiley faces” can be used to communicate ideas. In the context of perception, the brain can often fill in missing elements.

Knowledge and understanding are intimately related to communication and it is apparent that the degree of certainty of a conclusion is increased, as with language and vision, by redundancy - i.e., by different types of supporting evidence. But at the same time, both knowledge and understanding are always, at some level, incomplete, although fortunately, perfect understanding is not necessary for the effective use of knowledge. Incomplete information can, of course, lead to incorrect conclusions or inappropriate generalizations, which may slow progress for centuries, but as information is progressively accumulated a firm foundation on which understanding can be built is created.

The building of knowledge and understanding is a continuous process that extends across millennia and profits from the development of new tools for accumulating, storing and manipulating information. A central element of this process, linked closely to auditory and visual perception, is the brain’s constant search for patterns and symmetries. This applies, for example, to our present understanding of the molecular genetic basic of carcinogenesis - an overall framework exists, but new pieces of the jigsaw puzzle are constantly being fitted into place. Broad conceptual outlines based on a considerable amount of information are often referred to as theories - for example, the theory of evolution and information theory. Belief in their validity is enhanced by their inherent symmetry, created by the supportive evidence, and defects in such patterns permit more focused research designed to fill gaps or resolve discrepancies. Incomplete mathematical symmetries, for example, led to the prediction of the existence of the planet Neptune and of several fundamental particles prior to their discovery.

To Be or Not To Be
Claude Shannon, an electrical engineer and mathematician, was greatly concerned with noise and redundancy. His Master’s thesis in electrical engineering, submitted to the Massachusetts Institute of Technology in 1937, demonstrated that Boolean algebra and binary arithmetic (a “base-two” number system in which there are only two symbols (yes/no, on/off, present/absent or 1/0) could be combined to greatly simplify the use of electromagnetic switches (relays) used to route telephone calls. Boolean algebra deals with the logical manipulation of information contained in sets of objects, numbers or entities; The Boolean operators, AND, OR and NOT, are familiar to those who frequently search for publications relevant to their specific interests in sets of articles contained in databases such as PubMed. Of perhaps even greater significance, Shannon recognized that arrangements of electrical relays could be used to solve problems in Boolean algebra, which at its core is binary in nature since it can be reduced to making decisions about whether or not specific criteria are met.

Information from which no inference can be made is of no scientific value
Just one year before Shannon submitted his thesis, Alan Turing had described a theoretical machine that was, in essence, a programmable computer. He conceived of a simple tape (a memory device), a set of symbols (and blanks) regularly spaced along the tape, a means of reading, writing or erasing symbols, and the ability to move the tape one space at a time to the left or right according to a “table” of instructions. The machine could also keep track of its state, i.e., how far it had progressed through the table of instructions. The notion of using an appropriate algorithm (a set of defined instructions) that could be run on a simple machine was the immediate forerunner of the computer program. To bring the theoretical Universal Turing Machine (one capable of running any algorithm) into the realm of reality, a means of representing symbols that could be translated into a language appropriate to mechanical or electrical devices - a machine language - was required. Although not consciously pursuing this goal, Shannon’s further work would provide just this.

After completing his PhD thesis Shannon joined Bell Laboratories, where he examined a question of considerable importance to the telephone/communications industry - how to ensure the highest possible information content in electrical signals transmitted in copper cables. A necessary prerequisite was a means of quantifying information. In his classic paper of 1948 entitled “A Mathematical Theory of Communication,” Shannon laid the foundations of information theory. Building on his earlier ideas, he recognized that all information could be encoded in the form of a series of binary digits, which he referred to as bits. A bit answers a single, appropriately framed question, with either a yes or a no. Series of bits expand the number of questions that can be answered in such a way. Shannon recognized that a question with N possible outcomes can be answered with a string of log2 N bits (log to the base two, since each bit was either a yes or a no). For example, three bits of information are required to distinguish among eight possibilities; N=8 and log² 8 = 3 (8 = 2 x 2 x 2). Each string indicates “yes” to one of the eight possibilities and “no” to the remainder (Table 1). The alphabet can now be seen as a set of 26 possible outcomes requiring 5 bits per letter to ensure distinction among each. More characters or symbols can be included by adding a few extra bits. For example, in ASCII code, the American Standard Code for Information Exchange, each of the included 128 characters/symbols (27 being the characters of the alphabet and the space, and ten, the digits 0-9), is encoded by 7 bits (log 128 = 7). Eight-bit standard codes (eight bits are referred to as a byte), which allow 256 characters to be encoded have also been developed, as well as more universal codes, such as Unicode, that allow text and symbols from all of the world’s writing systems (scripts) to be represented and manipulated in digital format.

111 000
110 001
100 011
101 010
Table 1. The eight possible configurations of three bits.


Binary codes also provide a universal means of measuring information content. Although the number of bits in a stream of symbols, such as may be found in a document or book, provides no more information than the total number of letters, bits can encode not only characters, but also the length and pitch of sound (as in speech or music), monochrome images and even color. Encoding a signal intensity for each pixel in each frame of a television or computer screen permits the digitization of images in which color is numerically represented in terms of the admixture of specific intensities of three colors, red, green and blue, abbreviated to RGB (Figure 1). This system also permits different monochrome intensities to be converted into color to improve visual impact.

It is important to recognize that information theory does not deal with meaning. Rather, it provides a method of quantifying message-carrying capacity. The size of an electronic file can be expressed as the number of bytes, or in thousand-fold increments of bytes, i.e., kilobytes (KB), megabytes (MB), gigabytes (GB), terabytes (TB), etc. Technically, since we are dealing with a binary system, the closest number to 1000 is 2¹0;, i.e., 10 doublings, such that a kilobyte is sometimes considered to be 1024 rather than 1000 bytes and a megabyte either a million bytes, or 2²0;, the equivalent of 10242. Which system is used makes, in practice, little difference. The speed of transmission of digitized information can also be readily expressed as the number of bits, most often kilobits (kB) or megabits (mB), transmitted per second. Interestingly, in order to assure the integrity of digitized information and a minimum of mistakes generated by hardware, a certain amount of redundancy has been incorporated into computer programs and transmitted messages. Examples include parity bits, which are binary digits that indicate whether the sum of the 1’s in a stream of bits should be odd or even, and checksums, in which the bits are added up at intervals and the resultant sums stored and used to ensure that no changes have occurred in that part of the message during data storage or transmission.

Shannon Entropy
Essentially singlehandedly, Shannon had developed a means of measuring information mathematically. He also realized that the amount of information that could be carried in digital form (i.e., in bits) is inversely proportional to the amount of redundancy. In an infinite series of 1’s, for example, almost all of the message is redundant; the next digit is 100% predictable. Such a message could be very simply encoded as “always 1’s.” Conversely, a completely random stream of bits has no redundancy - it can only be represented by the actual sequence of bits since each successive bit has an equal chance of being a 1 or a 0. Although neither a completely predictable nor a completely random sequence of bits would convey any information, it is apparent that the capacity to carry information is closely related to the probability of predicting the next element in a message - and increases with increasing uncertainty.

Figure 1.  Numerical expression of color via red, green and blue (RGB) addition. Each of the 256 possible colors available here (each a point on the above colored area which has 255 points both vertically and horizontally) is specified by three numbers. Red is 255,0,0, green, 0,255,255 and blue, 0,0,255. Black is 0,0,0 and white 255,255,255. All other colors are various admixtures of RGB, i.e., three numbers between 0 and 255. A similar technique is involved in the quantification of intensity e.g., in imaging studies or scientific experiments. Image from Wikipedia Commons; author, Marc Mongenet.
Figure 1. Numerical expression of color via red, green and blue (RGB) addition. Each of the 256 possible colors available here (each a point on the above colored area which has 255 points both vertically and horizontally) is specified by three numbers. Red is 255,0,0, green, 0,255,255 and blue, 0,0,255. Black is 0,0,0 and white 255,255,255. All other colors are various admixtures of RGB, i.e., three numbers between 0 and 255. A similar technique is involved in the quantification of intensity e.g., in imaging studies or scientific experiments. Image from Wikipedia Commons; author, Marc Mongenet.


Remarkably, the mathematical expression for this “uncertainty” (log N) - in practical terms, the shortest sequence of bits required to transmit one message (or character) among all possible messages (or characters) - is identical to Rudolf Clausius’s 1865 statement of the second law of thermodynamics. Clausius introduced the term entropy (from the Greek, en: inside; trope: turn or change), to mean the energy in a system that cannot be converted into work. He showed that in an isolated system not yet at equilibrium entropy will always increase over time, achieving a maximum value at equilibrium (when there is also maximal uncertainty about the original state of the system). In effect, the law refers to an initially heterogeneous system, with respect to its distribution of energy, moving towards maximal homogeneity. A simple example is a glass of water in a perfectly insulated container to which a piece of ice has been added. Equilibrium is reached when the ice has completely melted and the temperature of the water (its macrostate) has become uniform. Of course, the temperature of each of the molecules (equivalent to its motion), which comprise the microstates of this system, is not precisely the same. Rather, the molecules follow a distribution curve with respect to temperature, the majority clustering about the mean. Thus, the entropy of a system describes the dispersal of its energy content in statistical terms, and is equal to k log W, where k is Bolzman’s constant and W corresponds to the number of different microstates in the system. The mathematician, John von Neumann, who made major contributions to the development of the electronic digital computer, recognized the similarity of the mathematical statement of Shannon’s uncertainty function (in the binary system, k = 1) to the statistical mechanical basis of thermodynamics and suggested that Shannon’s function should also be called entropy.

The second law of thermodynamics is of profound significance, since it expresses the process whereby, in open systems, mechanical work can be derived from energy. The model used by Clausius was a simple heat engine - a system in which the energy input, derived from combusted fuel, creates a difference in temperature between two reservoirs (containing water), causing molecules (steam) to flow from the hot to the cold reservoir - movement that can be readily converted into mechanical work. This, however, is an inherently inefficient process: more than 60% of the energy released from the fuel cannot be used to perform work but is dissipated, thereby increasing the entropy of the environment. The heat engine also provides a model for life itself. Energy used by living organisms, most of which derives ultimately from the sun, serves to maintain them (via work) far from equilibrium, and to allow for reproduction and evolution (and in human societies, for the creation of institutional structures). As with the heat engine, however, more energy is dissipated as heat than is available for work, such that the entropy of the environment, and eventually, of the overarching closed system, the universe, is increased—in conformity with the second law of thermodynamics. The notion of information entropy is also applicable to biological systems. For example, increasing the amount of information carried by the DNA of living organisms - increasing complexity in evolutionary terms, or biodiversity in the context of an ecosystem - equates to more information carrying capacity. And although information encoded in DNA is not binary in nature, information stored in the human brain probably is; ultimately, it rests upon whether individual neurons discharge or not. Nature also relies upon the security created by redundancy at many levels, ranging from molecular pathways to entire ecosystems. Ecosystems that contain more species are better able to withstand adverse environmental events. Life-creating energy, then, can also be seen to increase information entropy and the principle of the equivalence of energy and information is upheld.

Although initially controversial, the relationship between thermodynamic and information entropy is now much more widely (but not universally) accepted. Indeed, it has been suggested that thermodynamic entropy is a particular application of inference and information theory. Inference, or inductive reasoning, is simply the process of drawing a conclusion on the basis of the evidence, i.e., the available information. Microstates in thermodynamic systems comprise, in essence, the information which determines the macrostate (inference) of the system. Both thermodynamic and information entropy are based upon probabilities, such that the precision of the macrostate, or conclusion, is greater when the number of microstates, or bits of information, is high. Population-based cancer registration, for example, is generally undertaken only for populations in excess of a million to ensure that calculated incidence and mortality rates are accurate, while controlled clinical trials must include sufficient numbers of randomly selected patients for any difference observed to achieve statistical significance. Inference, combined with imagination, is a primary element of the scientific method and as such is essential to effective health interventions including cancer control. Information from which no inference can be made is of no scientific value, although this may simply reflect an insufficiency of information that can be overcome in the course of time.

Figure 2.  Superimposed images of computerized tomograph, a radiographic technique in which anatomical structure is digitized, permitting two or three dimensional reconstructed images, and PET scanning, in which functional imaging is based on uptake of radioactively labeled glucose. The intensity of uptake is digitally converted into color.  Image reproduced by kind permission of Jorge Carasquillo.
Figure 2. Superimposed images of computerized tomograph, a radiographic technique in which anatomical structure is digitized, permitting two or three dimensional reconstructed images, and PET scanning, in which functional imaging is based on uptake of radioactively labeled glucose. The intensity of uptake is digitally converted into color. Image reproduced by kind permission of Jorge Carasquillo.


Digital Developments
Shannon and Turing, between them, had provided the mathematical basis for the development of the digital computer and microchips. These have greatly extended our ability to store and manipulate information of all kinds and have also provided insights into human cognition. Along with telephony and radio communications, the invention of the Internet in the 1970s (a system of linked computer networks using a common protocol for the transmission of signals) and the World Wide Web (www) two decades later, which links documents via hypertext, i.e., by “clicking” on highlighted text, have permitted everyone with access to a suitably equipped computer, no matter where they are, to communicate almost instantaneously with anybody else in the world via sound, text and images and thereby to have historically unprecedented access to, and the ability to share, stored information. A semantic web, in which communication will more closely resemble human language, is presently under development. In the fields of medicine in general, and cancer in particular, digital electronics is essential to, or enhances the functionality of, a broad array of equipment for the investigation, monitoring and treatment of disease as well as storing, analyzing and disseminating (e.g., via the Web) data required for routine patient care or research (Figure 2). Computers have helped promote increased communication efficiency through the need to standardize case report forms in research and standardized reporting formats for the results of investigations, both of which reduce redundancy. Unwanted redundancy, or noise, abounds elsewhere - e.g., in the form of gratuitous computer printouts and computer spam and viruses. Computers and appropriate software, however, effectively used, can reduce errors and treatment toxicity, e.g., by the use of computerized prescribing or by improving the therapeutic ratio of radiation therapy (conformal planning and beam intensity modulation). In the context of research, the accumulation, analysis and use of evidence pertinent to hypotheses framed, leads to ever more efficient interventions, although many factors, including personal integrity and volition, technical skills and discipline as well as political and commercial considerations, influence the efficiency of this process. Of critical importance is the accuracy of stored data elements. Here, too, computerization can aid quality assurance. For example, electronic databases usually include a set of business rules that ensure that specific data elements fall within a predetermined range, or are consistent with values elsewhere in the case record.

Digital Disparities
While high-income countries move towards ever greater access to information as a consequence of ever faster speeds of data transmission and ever smaller storage spaces for ever large quantities of information (several gigabytes can be stored on a key ring), low and middle income countries, barely able to summon the energy requirements for the creation and maintenance of dissipative societal and institutional structures, have much more limited access to information - whether digital or not - and therefore, to education and training. The Web is likely to play a major role in alleviating this disparity as computers become smaller, cheaper and more portable, network connectivity improves and open source (i.e., created by the community for the community) software, goods and services, coupled to electronic commons, i.e., public information sources with relaxed copyright provisions, continuously expand (Table 2). Web-based education, training and communication will help to overcome obstacles created by limited human resources and the high cost of transportation (financially and environmentally) and telephony. Free access to the latest biomedical information via online journals will play a role in continuing education, while the ability to contribute directly to such information will encourage participation and a sense of ownership. The establishment of international and national networks will greatly enhance the sharing of "domain-specific" information, such as that needed to improve health and control cancer, creating electronic communities of practice through which information can be translated into action able to reach the most physically and financially isolated communities.

Resource Comment Website
Open Source Education re: open source software http://www.opensource.org
UNESCO Information and communication site http://unesco.org
http://portal.unesco.org/ci/en
Development Gateway Foundation Portal for world wide sharing of knowledge and information   http://topics.
developmentgateway.org
Open University (UK) Resources for students and educators http://www.open.ac.uk/
openlearn/home.php
MITOpenCourseWare All MIT courses online http://ocw.mit.edu/index.html
Johns Hopkins School of Public Health OpenCourseWare Free courseware in Public Health http://ocw.jhsph.edu
Wikipedia (on line encyclopedia) Users contribute or modify articles http://wikipedia.org
Wikipedia Commons Free content images and sound http://commons.wikimedia.org/ wiki/Main_Page
eGranary digital library Provides digital educational resources for institutions lacking Internet access http://www.widernet.org/
digitallibrary
World Health Organization Information on global health http://www.who.int/en
UICC Information on global cancer control http://www.uicc.org
IARC Epidemiologically focused resources http://www.iarc.fr/ENG/
Databases/index.php
US National Library of Medicine Health information and databases http://www.nlm.nih.gov/hinfo.html
National Cancer Institute (USA) Comprehensive cancer information http://www.cancer.gov
Alliance for Cervical Cancer Prevention Comprehensive information on the early detection of cervical cancer http://www.alliance-cxca.org
Table 2. Selected list of open resources available on the World Wide Web.


1948 might well be added to 1666 and 1905 as an annus mirabilis for science, for this was the year in which Shannon laid the foundations of information theory, leading to new analytical approaches and insights relevant to disciplines ranging from biology to astronomy. Even the advances made in Einstein’s annus mirabilis can be recast in terms of information theory, considered by some to be the third pillar of 20th century physics, alongside relativity and quantum mechanics. At a social level, a more equitable global distribution of information would do much to reduce the world’s present inequities and heighten the sense of a global community. Whether or not this will be accomplished depends largely upon how much energy - human and otherwise - is put into the effort.

 NETWORK Home
  The President's Message
 
Information and Cancer

  Articles
 
INCTR’s Web Portal

The Nurse’s Role in Oncology: An Essential Element of Cancer Control in Low and Middle-Income Countries

  Case Report
 
A Case of Hepatocellar Carcinoma in Egypt

  Forum
 
St. Jude Children’s Research Hospital

  Report
 
A Cervical Cancer Prevention Training Facility in Lima, Peru

  News
 
News Items

  Partner Profile
 
Fakous Cancer Center

  Profile in Cancer Medicine
 

Unveiling Breast Cancer


Copyright © 2008 The International Network For Cancer Treatment and Research