Build a "Next Generation"
National Biological InformationInfrastructure
"... more and more peoplerealize that information is a treasure that must be shared to be valuable...ourAdministration will soon propose using ... technology to create a globalnetwork of environmental information."
Albert Gore, Jr.,21 March 1994
"With all of everyone’s workonline, we will have the opportunity ... to let everyone use everyone else’sintellectual effort. ... The challenge for librarians and computer scientistsis to let us find the information we want in other people’s work..."
Michael E. Lesk (1997),http://community.bellcore.com/lesk/ksg97/ksg.html
The economic prosperity and,indeed, the fate of human societies are inextricably linked to the naturalworld. Because of this, information about biodiversity and ecosystems isvital to a wide range of scientific, educational, commercial, and governmentaluses. Unfortunately, most of this information exists in forms that arenot easily used. From traditional, paper-based libraries to scattered databasesand physical specimens preserved in natural history collections throughoutthe world, our record of biodiversity and ecosystem resources is uncoordinatedand isolated. It is, de facto, inaccessible. There exists no comprehensivetechnological or organizational framework that allows this informationto be readily accessed or used effectively by scientists, resource managers,policy makers, or other potential client communities. "We have ... vastmountains of data that never enter a single human mind as a thought. ...Perhaps this sort of data should be called ‘exformation’ instead of information... " (Albert Gore, Jr. , Earth in the Balance, pp.200-201).
However, significant increasesin computation and communications capabilities in recent years have openedup previously unimagined possibilities in the field of information technology,and these trends will continue for the foreseeable future. It is clearthat abundant, easily-accessible, analyzed and synthesized informationthat can and does "enter the human mind as a thought" will be essentialfor managing our biodiversity and ecosystem resources. Thus, research anddevelopment is needed in order to harness new information technologiesthat can help turn ecological "exformation" to "information."
We need computer science,library and information science, and communications technology research(hereafter abbreviated as CS/IT) to produce mechanisms that can, for example,efficiently search through terabytes of Mission to Planet Earth satellitedata and other biodiversity and ecosystems datasets, make correlationsamong data from disparate sources, compile those data in new ways, analyzeand synthesize them, and present the resulting information in an understandableand usable manner. At present, we are far from being able to perform theseactions on any but the most minor scale. However, the technology existsto make very rapid progress in these areas, if the attention of the CS/ITcommunity is focused on the biodiversity and ecosystems information domain.
Focus research on biodiversityand ecosystems information to promote use of that information in managementdecisions, in education and research, and by the public.
Knowledge about biodiversityand ecosystems, even though incomplete, is a vast and complex informationdomain. The complexity arises from two sources. The first of these is theunderlying biological complexity of the organisms and ecosystems themselves.There are millions of species, each of which is highly variable acrossindividual organisms and populations. These species each have complex chemistries,physiologies, developmental cycles and behaviors, all resulting from morethan three billion years of evolution. There are hundreds if not thousandsof ecosystems, each comprising complex interactions among large numbersof species, and between those species and multiple abiotic factors.
The second source of complexityin biodiversity and ecosystems information is sociologically generated.The sociological complexity includes problems of communication and coordination—betweenagencies, between divergent interests, and across groups of people fromdifferent regions, different backgrounds (academia, industry, government),and different views and requirements. The kinds of data humans have collectedabout organisms and their relationships vary in precision, accuracy, andin numerous other ways. Biodiversity data types include not only text andnumerical measurements, but also images, sound, and video. The range ofother data types with which scientists and other users will want to meshtheir biodiversity databases is also very broad: geographical, meteorological,geological, chemical, physical, etc. Further, the manner and mechanismsthat have been employed in biodiversity data collection and storage arealmost as varied as the natural world the datasets document. Therefore,analysis of the work practices involved in building these datasets is oneamong several CS/IT research priorities.
All this variability constitutesa unique set of challenges to information management. These challengesgreatly exceed those of managing gene or protein sequence data (and thatdomain is challenging in its own right). In addition to the complexityof the data, the sheer mass of data accumulated by satellite imagery ofthe Earth (terabytes per year are captured by Landsat alone) presents additionalinformation management challenges. These challenges must be met so thatwe can exploit what we do know, and expand that knowledge in appropriateand planned directions through research, to increase our ability to livesustainably in this biological world.
Various research activitiesare being conducted that are increasing our ability to manage biologicalinformation:
• Geographic InformationSystems (GIS) are expanding the ability of some agencies to conduct theiractivities more responsibly and making it possible for industry to choosesites for new installations more intelligently.
• The National Spatial DataInfrastructure has contributed to progress in dealing with geographic,geological, and satellite datasets.
• Research conducted as partof the Digital Libraries projects has begun to benefit certain informationdomains.
• The High-Performance Computingand Communications initiative has greatly benefited certain computation-intensiveengineering and science areas.
• All of science has benefitedfrom the Internet; those benefits will increase with the development ofthe "next generation" Internet, or Internet-2.
Given the importance of andneed for biodiversity and ecosystems data to be turned into informationso that it can be comprehended and applied to achieve a sustainable future,this Panel recommends that the attention of a number of governmental researchand research funding activities be directed toward the special needs ofbiodiversity and ecosystems data:
• The Digital Libraries Initiativeof the NSF, DARPA, and NASA should call for research specifically focusedon the biodiversity and ecosystems information domain in all future Requestsfor Proposals. Current Digital Libraries projects are working on some ofthe techniques needed (automatic indexing, sophisticated mapping, brokeringroutines, etc.), but the developments are not focused on biodiversity andecosystems information, which have their own unique characteristics.
• The Knowledge and DistributedIntelligence and the Life in Earth’s Environment initiatives of the NSFshould call for CS/IT and appropriate associated biological and sociologicalresearch specifically focused on the biodiversity and ecosystems informationdomain in all future Requests for Proposals.
• The NSTC Committee on Technologyshould focus on the biodiversity and ecosystems information domain withina number of its stated R&D areas, particularly: 1) addressing problemsof greater complexity (in the High End Computing and Computation ProgramArea); 2) advanced network architectures for disseminating environmentaldata (Large Scale Networking Program Area); 3) extraction and analyticaltools for correlating and manipulating distributed information, advancedgroup authoring tools, and scaleable infrastructures to enable collaborativeenvironments (Human Centered Systems Program Area); and 4) graduate andpostdoctoral training and R&D grants (Education, Training and HumanResources Program Area).
The problem of excess datawill get steadily worse if means are not devised to analyze and synthesizethose data quickly and effectively to turn them into usable and usefulinformation that can be brought to bear in decision-making, policy formulation,directing future research, and so on. Computers were invented to assisthumans in tedious computational tasks, which the conversion of satellitedata into useful information surely is. One reason that we have unuseddata is because we are collecting it while we still do not have efficientmeans to convert it into comprehensible information. What person couldbe expected to absorb and "understand" terabytes of satellite data by brainpoweralone, without the assistance of computers? The CS/IT research endeavoradvocated here will reap great rewards by inventing better means to makethe conversion from data to useful information. Much of the talent neededfor this work is employed in the private sector, and so public-privatepartnerships that involve software and hardware designers and biologistswill be needed to accomplish the task.
The investments that havebeen made in acquiring data are large ($1 billion per year on Mission toPlanet Earth is only one example). The full potential of those investmentswill not be realized if new tools for putting the data to use are not devised.Unused data are not worth the initial investment made in gathering them.Failure to develop the technologies to manipulate, combine, and correlatethe biodiversity and ecosystems data we have available from all sourceswill have adverse effects on our ability to predict and prevent degradationof our natural capital.
Federal computing, information,and communications programs invest in critical, long-term R&D thatadvances computing, information, and communications in the United States.These investments to date have enabled government agencies to fulfill theirmissions more effectively and better understand and manage the physicalenvironment. They have also contributed much to US economic competitiveness.It is our contention that future investments by the government’s computing,information, and communications programs that are overseen by the NSTCCommittee on Technology should be concentrated in the area of biodiversityand ecosystems information. As has happened in other areas, this Federalinvestment will enable agencies to manage the biological environment inbetter ways, and will very likely spin off new technologies that can beexploited by the private sector to benefit the US economy.
The first of these investmentsshould be made in the next round of competition for research awards. Progressin the development of the needed technologies can be measured by increasesin the ability of agencies to utilize data they already have or are nowcollecting, in the creation of private sector jobs and businesses thatare directly related to biodiversity and ecosystems information management,and in research that is more clearly focused because proper data managementhas illuminated both what is already known and what remains to be discovered.
Design and construct the"next generation"
National Biological InformationInfrastructure (NBII-2).
The CS/IT research describedabove will contribute to progress in managing biodiversity and ecosystemsinformation. The productivity of individual research groups, driven bytheir own curiosity, ingenuity, and creativity has served this countrywell in myriad fields of science and the development of technology. Yet,there are important issues in the management and processing of biodiversityand ecosystems information that must be addressed in a much more coordinatedand concerted way than has been attempted to date.
The value of raw data istypically predicated on our ability to extract higher-order understandingfrom those data. Traditionally, humans have done the task of analysis:one or more analysts become familiar with the data and with the help ofstatistical or other techniques provide summaries and generate results.Scientists, in effect, generate the "correct" queries to ask and even actas sophisticated query processors. Such an approach, however, rapidly breaksdown as the volume and dimensionality (depth and complexity) of the dataincrease. What person could be expected to "understand" millions of cases,each having hundreds of attributes? This is the same question asked aboutsatellite data above—human brainpower requires sophisticated assistancefrom computers to complete these sorts of tasks. The current National BiologicalInformation Infrastructure (NBII) is in its infancy, and cannot providethe sophisticated services that will enable the simultaneous querying andanalysis of multiple, huge datasets. Yet, it will become more and morenecessary to manipulate data in this way as good stewardship of biodiversityand ecosystems grows increasingly important.
The overarching goal of the"next generation" National Biological Information Infrastructure, or NBII-2,would be to become, in effect, a fully digitally accessible, distributed,interactive research library system. The NBII-2 would provide an organizingframework from which scientists could extract useful information —new knowledge—fromthe aggregate mass of information generated by various data gathering activities.It would do this by harnessing the power of computers to do the underlyingqueries, correlation, and other processing activities that at present requirea human mind. It would make analysis and synthesis of vast amounts of datafrom multiple datasets much more accessible to a variety of users. It wouldalso serve management and policy, education, recreation, and the needsof industry by presenting data to each user in a manner tailored to thatuser’s needs and skill level.
We envision the NBII-2 asa distributed facility that would be something considerably different thana "data center," something considerably more functional than a traditionallibrary, something considerably more encompassing than a typical researchinstitute. It would be all of these things, and at the same time none ofthem. Unlike a data center, the objective would not be the collection ofall datasets on a given topic into one storage facility, but rather theautomatic discovery, indexing, and linking of those datasets. Unlike atraditional library, which stores and preserves information in its originalform, this special library would not only keep the original form but alsoupdate the form of storage and upgrade information content. Unlike a typicalresearch institute, this facility would provide services to research goingon elsewhere; its own staff would conduct both CS/IT and biodiversity andecosystems research; and the facility would offer "library" storage andaccess to diverse constituencies.
The core of the NBII wouldbe a "research library system" that would comprise five regional nodes,sited at appropriate institutions (national laboratories, universities,etc.) and connected to each other and to the nearest telecommunicationsproviders by the highest bandwidth network available. In addition, theNBII-2 would comprise every desktop PC or minicomputer that stores andserves biodiversity and ecosystems data via the Internet. The providersof information would have complete control over their own data, but atthe same time have the opportunity to benefit from (and the right to refuse)the data indexing, cleansing, and long-term storage services of the systemas a whole.
• A common focus for independentresearch efforts, and a global, neutral context for sharing informationamong those efforts;
• An accrete-only, no-deletefacility from which all information would be available online, twenty-fourhours a day, seven days a week in a variety of formats appropriate to agiven user;
• A facility that would servethe needs of (and eventually be supported by partnership among) government,the private sector, education, and individuals;
• An organized frameworkfor collaboration among Federal, regional, state, and local organizationsin the public and private sectors that would provide improved programmaticefficiencies and economies of scale through better coordination of efforts;
• A commodity-based infrastructurethat utilizes readily available, off-the-shelf hardware and software andthe research outputs of the Digital Libraries initiative where possible;
• An electronic facilitywhere scientists could "publish" biodiversity and ecosystems informationfor cataloging, automatic indexing, access, analysis, and dissemination;
• A place where intensivework is conducted on how people use these large databases, and how theymight better use them, including improvement of interface design (human-computerinteraction);
• A mechanism for developmentof organizational and educational infrastructure that will support sharing,use and coordination of these massive data sets;
• A mirroring and/or backupfacility that would provide content storage resources, registration ofdatasets, and "curation" of datasets (including migration, cleansing, indexing,etc.);
• An applied biodiversityand ecosystems informatics research facility that would develop new technologiesand offer training in informatics;
• A facility that would providehigh end computation and communications to researchers at diverse institutions.
This facility would not bea purely technical and technological construct, but rather would also encompasscomplex sociological, legal, and economic issues in its research purview. These might include intellectual property rights management, public accessto the scholarly and cultural record, and the characteristics of evolvingsystems of scholarly and mass communications in the networked informationenvironment. The human dimensions of the interaction with computers, networks,and information will be a particularly important area of research as systemsare designed for the greatest flexibility and usefulness to people.
The needs that the researchnodes of the NBII-2 must address are many. A small subset of those needsincludes:
• Workable data-cleaningmethods that automatically correct input and other types of errors in databases;
• Strategies for samplingand selecting data;
• Algorithms for classification,clustering, dependency analysis, and change and deviation detection thatscale to large databases;
• Visualization techniquesthat scale to large and multiple databases;
• Metadata encoding routinesthat will make data mining meaningful when multiple, distributed sourcesare searched;
• Methods for improving connectivityof databases, integrating data mining tools, and developing ever bettersynthetic technologies;
• Ongoing, formative evaluation,detailed user studies, and quick feedback between domain experts, users,developers and researchers.
Box 8: Why do we needan NBII-2?
Biodiversity is complex; ecosystemsare complex. The questions we need to ask in order to manage and conservebiodiversity and ecosystems therefore require answers comprised of informationfrom many sources. As described in the text, our current ability to combinedata from many sources is not very good, or very rapid—a human being usuallyhas to perform the tasks of correlation, analysis, and synthesis of datadrawn painstakingly from individual datasets, one at a time. The NBII oftoday only has the capability to point a user toward single data sets,one at a time, that might (or might not) contain data that are relevantto the user’s question. If the dataset does appear useful, the human mustconstruct a query in a manner structured by the requirements of the particularapplication that manages the dataset (which likely as not is somewhat arcane).The human must then collate results of this query with those of other queries(which may be very difficult because of differences in structures amongdatasets), perform the analyses, and prepare the results for presentation.What we need is an organizing framework that will allow that same humanbeing to construct a query in everyday language, and automaticallyobtain exactly the information needed from all datasets available on theInternet. These data would be automatically filtered, tested forquality, and presented in correlated, combined and analyzed form, readyfor the human mind to perform only higher-order interpretation. With toolssuch as these, we will begin to be able to "mine" the information we alreadyhave to generate new insights and understanding. At present, the task of"data mining" in the biodiversity and ecosystems information domain isso tedious as to be unrewarding, despite our very great need for the insightsit has the potential to yield.
Box 9: Why do we needan NBII-2? Scenario 1
An agricultural researcher has justisolated and characterized a gene in a species of Chenopodium thatenables the plant to tolerate high-salt soil. To find out about other characteristicsof the habitat within which that gene evolved, the researcher uses NBII-2to link to physical data on the habitat (temperature and rainfall regimes,range of soil salinity, acidity, texture and other characteristics, elevationand degree of slope and exposure to sunlight, etc.), biological informationabout other plants with which this Chenopodium occurs in nature,data on animals that are associated with it, and its phylogenetic relationshipto other species of Chenopodium, about which the same details aregathered. Linkages among these ecological and systematic databases andbetween them and others that contain gene sequence information enable theresearcher to determine that the gene she has isolated tolerates a widerrange of environmental variables than do equivalents in other species thathave been tested (although this analysis also points out additional speciesthat it would be worthwhile to test). The gene from this species is selectedas a primary candidate for insertion by transgenic techniques into forageand browse plants to generate strains that will tolerate high-salt soilsin regions that currently support sheep and cattle but which are becomingmore and more arid (and their soils saltier) because of global climatechange.
Box 10: Why do we needan NBII-2? Scenario 2
On an inspection of a watershed area,a resource manager finds larval fish of a type with which he is unfamiliar.Returning to the office, the manager accesses an online fish-identificationprogram. Quickly finding that there are several alien species representedin the sample he took, he then obtains information on the native rangesof these species, their life history characteristics, reproductive requirementsand rates, physiological tolerances, ecological preferences, and naturalpredators and parasites from databases held by natural history museumsaround the world. He is able to ascertain that only one of the alien speciesis likely to survive and spread in this particular watershed. Online, heis also able to access data sets that describe measures taken against thisspecies in other resource management areas, and the results of those measures.By asking the system to correlate and combine data on the environmentalcharacteristics of the fish’s native range that have been measured by satellitepasses for the past 20 years, as well as the environmental characteristicsof the other areas into which it had been introduced, he is able to ascertainwhich of the management strategies is most likely to work in the situationhe faces. Not only does the manager obtain this information in a singleafternoon, but he is able to put the results to work immediately, beforepopulations of the invading fish species can grow out of control. The formand results of the manager’s queries are also stored to enable an evenfaster response time when the same or a related species is discovered inanother watershed.
Box 11: Why do we needan NBII-2? Scenario 3
A community is in conflict over selectionof areas for preservation as wild lands in the face of intense pressuresfor development. The areas available have differing characteristics anddiffering sets of endangered species that they support. The NBII-2 is usedto access information about each area that includes vegetation types, spatialarea required to support the species that occur there, optimal habitatfor the most endangered species, and the physical parameters of the habitatsin each of the areas. In addition, information on the characteristics andneeds of each of the species is drawn from natural history museums aroundthe world. Maps of the area are downloaded from the US Geological Survey,and other geographic information data layers are obtained from an archiveacross the country. Also, the NBII-2 even provides access to software developedin other countries specifically for the purpose of analyzing these multipledata types. The analyses conducted on these datasets using this softwareprovide visually understandable maps of the areas that, if preserved, wouldconserve the greatest biodiversity, and of those areas that would be lessdesirable as preserves. Conservation biologists then make information-basedpredictions about success of species maintenance given differing decisions.On the basis of the sound scientific information and analysis deliveredby the NBII-2, the conflict is resolved and the community enjoys the benefitsof being stewards of natural capital as well as the benefits of economicgrowth.
If all the species of theworld were discovered, cataloged, and described in books with one specimenper page, they would take up nearly six kilometers of shelving. This isabout what you would find in a medium-size public library. The total volumeof biodiversity and ecosystems data that exist in this country has notbeen calculated, probably because it is so extensive as to be extremelydifficult to measure.
Of course, the complete recordof biodiversity and ecosystems is orders of magnitude greater than thisand exists in media types far more complex than paper. Biodiversity andecosystem information exists in scores of institutional and individualdatabases and in hundreds of laboratory and personal field journals scatteredthroughout the country. In addition, the use of satellite data, spatialinformation, geographic information, simulation, and visualization techniquesis proliferating (NASA currently holds at least 36 terabytes of data thatare directly relevant to biodiversity and ecosystems), along with an increasinguse of two- and three-dimensional images, full-motion video, and sound.
The natural history museumsof this country contain at least 750 million specimens that comprise a150- to 200-year historical record of biodiversity. Some of the informationassociated with these collections has been translated into electronic form,but most remains to be captured digitally. There are many datasets thathave been digitized, but are in outdated formats that need to be portedinto newer systems. There are also datasets that are accessible but ofquestionable, or at least undescribed, quality. There are researchers generatingvaluable data who do not know how to make those data available to a widevariety of users. And data once available online can still be lost to thecommunity when their originator dies or retires (our society has yet tocreate a system that will keep data alive and usable once the originatoris no longer able to do so). For these reasons, we lose the results ofa great deal of biodiversity and ecological research that more than likelycannot be repeated.
Potentially useful and criticallyimportant information abounds, but it is virtually impossible to use itin practical ways. The sheer quantity and diversity of information requirean organizing framework on a national scale. This national framework mustalso contribute to the Global Information Infrastructure, by making possiblethe full and open sharing of information among nations.
The term "data mining" hasbeen used in the database community to describe large-scale, syntheticactivities that attempt to derive new knowledge from old information. Infact, data mining is only part of a larger process of knowledge discoverythat includes the large-scale, interactive storage of information (knownby the unintentionally uninspiring term "data warehousing"), cataloging,cleaning, preprocessing, transformation and reduction of data, as wellas the generation and use of models, evaluation and interpretation, andfinally consolidation and use of the newly extracted knowledge. Data miningis only one step in an iterative and interactive process that will becomeever more critical if we are to derive full benefit from our biodiversityand ecosystems resources.
New approaches, techniques,and solutions must be developed in order to translate data from outmodedmedia into usable formats, and to enable the analysis of large biodiversityand ecosystems databases. Faced with massive datasets, traditional approachesin database management, statistics, pattern recognition, and visualizationcollapse. For example, a statistical analysis package assumes that allthe data to be analyzed can be loaded into memory and then manipulated.What happens when the dataset does not fit into main memory? What happensif the database is on a remote server and will never permit a naive scanof the data? What happens if queries for stratified samples are impossiblebecause data fields in the database being accessed are not indexed so theappropriate data can be located? What if the database is structured withonly sparse relations among tables, or if the dataset can only be accessedthrough a hierarchical set of fields?
Furthermore, problems oftenare not restricted to issues of scalability of storage or access. For example,what if a user of a large data repository does not know how to specifythe desired query? It is not clear that a Structured Query Language statement(or even a program) can be written to retrieve the information needed toanswer such a query as "show me the list of gene sequences for which voucherspecimens exist in natural history collections and for which we also knowthe physiology and ecological associates of those species." Many of theinteresting questions that users of biodiversity and ecosystems informationwould like to ask are of this type; the data needed to answer them mustcome from multiple sources that will be inherently different in structure.Software applications that provide more natural interfaces between humansand databases than are currently available are also needed. For example,data mining algorithms could be devised that "learn" by matching user-constructedmodels so that the algorithm would identify and retrieve database recordsby matching a model rather than a structured query. This would eliminatethe current requirement that the user adapt to the machine’s needs ratherthan the other way around.
The major research and infrastructurerequirements of the digitally accessible, distributed, interactive, researchlibrary are several:
The library will of necessityplace extensive and challenging demands on network hardware infrastructureservices, as well as those services relating to authentication, integrity,and security, including determining characteristics and rights associatedwith users. We need both a fuller implementation of current technologies—suchas digital signatures and public-key infrastructure for managing cryptographickey distribution—and a consideration of tools and services in a broadercontext related to library use. For example, the library system may haveto identify whether a user is a member of an organization that has someset of access rights to an information resource. As a national and internationalenterprise that serves a very large range of users, the library must bedesigned to detect and adapt to variable degrees of connectivity of individualresources that are accessible through networks.
A fully digital, interactivelibrary system requires substantial computational and storage resourcesboth in servers and in a distributed computational environment. Littleis known about the precise scope of the necessary resources, and so experimentationwill be needed to determine it. Many existing information retrieval techniquesare extremely intensive in both their computational and their input-outputdemands as they evaluate, structure, and compare large databases that existwithin a distributed environment. In many areas that are critical to digitallibraries, such as knowledge representation and resource description, orsummarization and navigation, even the basic algorithms and approachesare not yet well defined, which makes it difficult to project computationalrequirements. It does appear likely, however, that many operations of digitallibraries will be computationally intensive—for example, distributed databasesearching, resource discovery, automatic classification and summarization,and graphical approaches to presenting large amounts of information—becausedigital library applications call for the aggregation of large numbersof autonomously managed resources and their presentation to the user asa coherent whole.
Even though the library systemwe are proposing here would not set out to accrue datasets or to becomea repository for all biodiversity data (after all, NASA and other agencieshave their own storage facilities, and various data providers will wantto retain control over their own data), massive storage capabilities ondisc, tape, optical or other future technology (e.g., holography) willstill be required. As research is conducted to devise new ways to manipulatehuge datasets, such datasets will have to be sought out, copied from theiroriginal source, and stored for use in the research. And, in serving itslong-term curation function, the library will accumulate substantial amountsof data for which it will be responsible. The nodes will need to mirrordatasets (for redundancy to ensure data persistence) of other nodes orother sites, and this function will also require storage capacity.
Information management:Major advances are needed in methods for knowledge representation and interchange,database management and federation, navigation, modeling, and data-drivensimulation; in effective approaches to describing large complex networkedinformation resources; and in techniques to support networked informationdiscovery and retrieval in extremely large scale distributed systems. Inaddition to near-term operational solutions, new approaches are also neededto longer-term issues such as the preservation of digital information acrossgenerations of storage, processing, and representation technology. Traditionalinformation science skills such as thesaurus construction and complex indexingare currently being transformed by the challenge of making sense of thedata on the World Wide Web and other large information sources. We needto preserve and support the knowledge of library and information scienceresearchers, and help scale up the skills of knowledge organization andinformation retrieval.
Data mining, indexing,statistical and visualization tools: The library system will use aswell as develop tools for its various functions. Wherever possible, toolswill be adopted and adapted from other arenas, such as defense, intelligence,and industry. A reciprocal relationship among partners in these developmentswill provide the most rapid progress and best results.
• Research Issues:
Many of the research issuesto be taken up by the researchers who work at the virtual library systemhave been mentioned in the discussion above. Among the most important issuesare content-based analysis, data integration, automatic indexing on multiplelevels (of content within databases, of content and quality of databasesacross disciplines and networks, of compilations of data made in the processof research, etc.), and data cleansing. The latter is a process that atpresent is extremely tedious, time- and human labor-intensive, and inefficientand often ineffectual. Much of the current expenditure on databases isconsumed by the salaries of people who do data entry and data verification.Automatic means of carrying out these tasks are a priority if we are tobe able to utilize our biodiversity and ecosystems information to protectour natural capital.
As Vice President Gore said,we have excess data that are unused. Yet we have paid substantial sumsto collect those data, and, if they are analyzed and synthesized properly,they can contribute much to our understanding of biodiversity and ecosystems.Our national natural capital is too critically important for us to failto devote the time and energy required to learn to use it sustainably.To develop the means to do that, we need to have knowledge and understandingof biodiversity and ecosystems; to develop that knowledge and understanding,we must mine the data that we have and that we are generating for correlationsthat will identify pattern and process. We must prevent what Mr. Gore referredto as "data rot" and "information pollution" by putting the data to use.To do that effectively, we must employ the tools and technologies thatare making data mining possible. If we do not build the fully digitallyaccessible, interactive, research library of biodiversity and ecosystemsinformation, we will lose the opportunity to realize the fullest returnson our data-gathering investments and also to optimize returns from ournatural capital.
We recommend that an appropriateavenue be found for further planning and implementation of the librarysystem. The planning panel should include knowledgeable individuals fromgovernment, the private sector, and academia. It should further developthe interactive research library concept, and design a plan whereby siteswould be proposed and chosen and the work carried out. A request for proposalswill be needed, and a means of selecting the most meritorious among these.Many government agencies will of necessity be involved in this process,and all should contribute expertise where needed, but we recommend thatthe NSF take the overall lead in the process, supported by the NSTC Committeeon Technology (CIT), with participation from agencies that hold biodiversityand ecosystems information but which are not members of the CIT.
Each of the regional nodesthat will form the core of the digitally accessible, interactive, researchlibrary system will require an annual operating budget of at least $8 million.Supporting five or six such nodes (the number we regard as adequate tothe task) and the high-speed connections among them will therefore requirea minimum of $40 million per year, an amount that represents a mere fractionof the funds spent government-wide each year to collect data (conservativelyestimated at $500 million)—data that may or may not be used or useful becausethe techniques and tools to put it to optimal use have yet to be developed.As with the Internet itself, and other computer and information technologies,the Federal government plays a "kickoff" or "jumpstart" role in the institutionof a new infrastructure. Gradually, support and operation of that infrastructureshould shift to other partners, just as has happened with the Internet,although in this case there will have to be at least a modicum of permanentFederal support (for the maintenance of its own data, for instance).
The planning and request-for-proposalsprocess should be conducted within one year. Merit review and selectionof sites should be complete within the following six months. The staffingof the sites and initial coordination of research and outreach activitiesshould take no more than a year after initial funding is provided. The"lifetime" of any one facility should probably not be guaranteed for anymore than five years, but the system must be considered a long-term activity,so that data access is guaranteed in perpetuity. Evaluation of the sitesand of the system should be regular and rigorous, although the milestoneswhereby success can be measured will be the incremental improvements inease of use of the system by policy-makers, scientists, householders andeven shool children. In addition, an increasing number of public-privatepartnerships that fund the research and other operations will indicatethe usefulness of accessible, integrated information to commercial as wellas governmental concerns.
President and First Lady | Vice President and Mrs. Gore