Cyberinfrastructure and the Sciences at Liberal Arts Colleges
Introduction
The
technical nature of scientific research led to the establishment of
early computing infrastructure and today, the sciences are still
pushing the envelope with new developments in cyberinfrastructure.
Education in the sciences poses different challenges, as faculty must
develop new curricula that incorporate and educate students about the
use of cyberinfrastructure resources. To be integral to both
science research and education, cyberinfrastructure at liberal
institutions needs to provide a combination of computing and human
resources. Computing resources are a necessary first element, but
without the organizational infrastructure to support and educate faculty and students alike, computing facilities will have only
a limited impact.
A complete local
cyberinfrastructure picture, even at a small
college, is quite large and includes resources like email, library databases and on-line information sources, to
name just a few. Rather than trying to cover such a broad range, this article
will focus on the specific hardware and human resources that are key to a
successful cyberinfrastructure in the sciences at liberal arts institutions. I
will also touch on how groups of institutions might pool resources,
since the
demands posed by the complete set of hardware and technical staff may be larger than a
single institution alone can manage. I should point out that many of these features
are applicable to both large and small universities, but I will emphasize those
elements that are of particular relevance to liberal arts institutions. Most
of this discussion is based on experiences at Wesleyan University over the
past several years, as well as plans for the future of our current
facilities.
A brief history of computing infrastructure
Computing needs in the sciences have
changed dramatically over the years. When computers first became an integral
element of scientific research, the hardware needed was physically very large
and very expensive. This was the "mainframe" computer and, because of the cost and size, these machines were generally maintained as a central
resource. Additionally, since this was a relatively new and technically
demanding resource, it was used primarily for research rather than education
activities.
The
desktop PC revolution started with the IBM AT in 1984 and led to the
presence of a computer on nearly every desk by the mid 1990's. The
ubiquity of desktop computing initiated tremendous change to both the
infrastructure and uses of computational resources. The affordability and
relative power of new desktops made mainframe-style computing largely
obsolete. A computer on every desktop turned users into amateur computer
administrators. The wide availability of PCs also meant that students grew up
with computers and felt comfortable using them as part of their
education. As a result, college courses
on programming and scientific computing, as well as general use of computers
in the classroom, became far more common.
Eventually, commodity computer hardware
became so cheap that scientists could afford to buy many computers to
expand their research. Better yet, they found ways to link computers
together to form inexpensive supercomputers, called clusters or "Beowulf" clusters, built from cheap, off-the-shelf
components. Quickly, the size of these do-it-yourself clusters grew very
large, and companies naturally saw an opportunity to manufacture and sell them
ready-made. People no longer needed detailed technical knowledge of how to
assemble these large facilities; they could simply buy them.
This
widespread availability of cluster resources has brought the
cyberinfrastructure needs full circle. The increasing size, cooling
needs, and complexity of maintaining a large computing cluster has
meant that faculty now look to information technology (IT) services to
house and maintain cluster facilities. Maintaining a single large
cluster for university-wide usage is more cost effective than maintaining several smaller
clusters
and reduces administrative overhead. Ironically, we seem to have
returned to something resembling the mainframe model. At the same time,
the more recently developed desktop support remains critical. As
technology continues to progress, we will doubtless shift paradigms
again, but the central cluster would appear to be the dominant approach
for at least the next five years.
Hardware resources
The cluster is the central piece
of hardware--but what makes up the cluster? How large a cluster is needed?
Before we can address the question of size, we should outline the key
elements. This becomes somewhat technical, so some readers may wish to skip
the next five paragraphs.
First, there is the raw computing power
of the processors to consider. This part of the story has become more
confusing with the recent advent of multiple core processors. In short, a
single processor may have 2, 4 or, soon, 8 processing cores, each of which is
effectively an independent processor. This does not necessarily mean it can do
a task faster, but it can perform multiple tasks simultaneously. Today, I
think of the core as the fundamental unit to count, since a single processor
may have several cores, and a single "node" (physically, one computer) may
have several processors. For example, at Wesleyan, we recently installed a
36-node cluster, each node having 2 processors and each processor having 4
cores. So while a 36-node cluster may not sound like much, it has packed into
it 288 computing cores.
This high density of computing cores has
several advantages: it decreases the footprint of the cluster; decreases
cooling needs; and decreases the number of required connections. For the
moment, let's focus on connectivity. The speed of connections between
computers is glacial in comparison to the speed of the processors. For
example, a 2-GHz processor does one operation every 0.5 nanoseconds. To get an
idea of how small amount of time this is, consider that light travels just about 6
inches in this time. The typical latency--the time lost to initiate a
transmission--of a wired ethernet connection is in the range of 0.1-1
milliseconds, or around 2000 clock cycles of the processor. Hence, if a
processor is forced to wait for information coming over a network, it may
spend a tremendous number of cycles twiddling its thumbs, just due to latency.
Add the time for the message to transmit, and the problem becomes even worse.
Multiple cores may help limit the number of nodes, and therefore reduce the number of
connections, but the connectivity problem is still unavoidable. So what to
do?
The answer depends on the intended usage
of the cluster. In many cases, users want to run many independent, single
process, or serial, tasks. In this case, communication between the various pieces
is relatively unimportant, since the vast majority of the activity is
independent. Ordinary gigabit ethernet should suffice in this situation and
is quite cheap. If the usage is expected to include parallel applications,
where many cores work together to solve a single problem faster, it may be
necessary to consider more expensive solutions. However, given that it is easy
to purchase nodes containing 8 cores in a single box, these expensive and
often proprietary solutions are only needed for rather large parallel
applications, of which there are relatively few.
All this processing power is useless, however,
without a place to store the information. This is most commonly achieved by
hard disks that are bundled together in some form, though for the sake of simplicity, they appear to the
end user as a single large disk. These bundles of disks can
easily achieve storage sizes of tens to hundreds of terabytes, a terabyte being
1000 gigabytes. The ability to store such large amounts of information is
particularly important with the emergence in the last decade of informatics
technologies, which rely on data-mining of very large data sets.
The last, and sometimes the greatest
challenge, is housing and cooling the cluster. Even with the high density of
computing cores, these machines can be large and require substantial
cooling. A dedicated machine room with supplemental air conditioning is
needed, typically maintained by an IT services organization. Fortunately, most
IT organizations already have such a facility, and with the decreasing size of
administrative university servers, it is likely that space
can be found without major building modifications. However, do not be
surprised if additional power or further boosting of cooling is needed. The
involvement of the IT organization is critical to the success of
infrastructure. Accordingly,
it is important that IT services and technically-inclined faculty
cultivate a good working relationship in order to communicate
effectively about research and education needs.
OK, but how big?
Given
these general physical specifications for the key piece of hardware,
the question remains, how big a cluster? Obviously the answer depends
on the institution, but I estimate 3 or 4 processing cores for each
science faculty member. An alternate and perhaps more accurate way to
estimate is to consider how many faculty members are already heavy
computational users and already support their own facilities. I would
budget about 50 cores for each such faculty member, though it is wise
to more carefully estimate local usage. Part of the beauty of a shared
facility is that unused computing time that might be lost on an
individual faculty member's facility can be shared by the community,
reducing the total size of the cluster necessary to fulfill peak needs.
Software needs tend to be specialized according to the intended uses, but it is important to budget funds for various software
needs, such as compilers and special purpose applications. The Linux operating
system is commonly used on these clusters and helps to keep down
software costs since it is an open source system. For many scientific
computing users, Linux is also the preferred environment
regardless of cost.
The cluster itself is of limited use without the human resources--that is, the technical staff--to back it up. At a minimum, a dedicated systems administrator is needed to ensure the smooth operation of the facility. Ideally, the administrator can also serve as a technical contact for researchers to assist in the optimal use of the cluster facility. However, to make the facility widely accessible and reap the full benefit for the larger university community, a more substantial technical support staff is needed.
The human element: resource accessibility
The presence of a substantial cluster is
an excellent first step, but without additional outreach, the facility is
unlikely to benefit anyone other than the expert users who were previously
using their own local resources. Outreach is key and can take a number of
forms.
First, faculty who are expert in
the use of these computer facilities need to spearhead courses that introduce students
to the use and benefits of a large cluster. This will help build a pool of
competent users who can spread their knowledge beyond the scope of
the course. This effort requires little extra initiative and is common at
both liberal arts and larger universities.
Second, it is particularly important in a liberal arts environment to develop and sustain a broad effort to help non-expert
faculty take advantage of this resource for both research and educational
purposes. Otherwise, the use of these computers will likely remain limited to the existing expert
faculty and the students whom they train.
Outreach
across the sciences can also take the form of a cross-disciplinary
organization. At Wesleyan, we established a Scientific Computing and
Informatics Center, with the goal of both facilitating the use of
high-performance computing and supporting course initiatives that
use computational resources. The center is directed by a dedicated
coordinator, who is
not burdened with the
technical duties of the systems administrator, and is assisted by trained
student tutors.
The
first goal of the center, facilitating cluster use, is primarily
research-oriented. That is, the center serves as a resource where
faculty and students can seek assistance or advice on a range of
issues--from simple tasks like accessing the resources to complex
problems like optimization or debugging complex codes. In addition, the
center offers regular tutorials on the more common issues, making
broader contact across the institution.
The second goal--educational outreach--is
particularly important for liberal arts institutions. Educational
outreach deals with all aspects of computational activities in the curriculum,
not just cluster-based activities. For example, if a faculty member wishes to
make use of computational software, the center staff will offer training to
the students in the course, thereby leaving class time to focus on
content. The center staff will also be available for follow-up assistance as
the need arises. This eliminates the problem of trying to add or include
training for computational resources in existing courses.
But efforts should not stop at this level.
While we are still in the early stages of our experiment at Wesleyan, I
believe that such a support organization will not have a significant impact if
it simply exists as a passive
resource. The center must actively
seek out resistant faculty and
demonstrate through both group discussions and one-on-one interactions how computational
resources can enhance their teaching activities.
To maintain the long-term vitality of this kind of center, it is important to maintain a group of trained and
motivated student tutors. To do this, we have chosen is to offer students summer
fellowships to work on computationally demanding
research projects with faculty. Some of these
students then serve as tutors during the academic year. Combined with this
summer program are regular lecture and tutorial activities. These tutorials
may also be expanded to reach beyond the bounds of the university to other
institutions as workshop activities.
Cross-institutional collaboration
Sometimes,
all of these goals can be met by a single
institution. But even if this is possible, there are still benefits to looking
outside the institution. And for smaller institutions, pooling
resources may be the only way to develop an effective
cyberinfrastructure.
While high-speed networks now make it technically possible to establish
inter-institutional efforts across the country, it is important to be able to
gather together a critical mass of core users who can easily interact with
each other. In my own experience, this
happens more easily when the users are
relatively nearby, say no more than 100 miles apart. It means that
institutions can share not only the hardware resources over the network, but
also the technical support staff. Of course, day-to-day activity is limited to
interaction within an institution or virtual communications between
institutions, but frequent and regular person-to-person interaction can be
established at modest distances.
Balancing individual institutional
priorities in such a collaboration is obviously a delicate process, but I
envision that the institution with the most developed IT services can house
and maintain the primary shared hardware resource, thereby reducing the
administrative needs across several institutions. Adequate access to
facilities can be guaranteed by taking advantage of the fact that most states
maintain high-speed networks dedicated for educational usage. In addition,
there are many connections between these state networks, such as the New
England Regional Network. Personal interactions can be facilitated by regular
user group meetings where users can share their questions and concerns with an
audience that extends beyond their institution. In addition, new electronic
sharing tools, such as wikis and blogs, can help foster more direct virtual
communications.
Summary
To have a successful cyberinfrastructure
in the sciences, it is essential to develop both hardware and human resources.
Personal support and outreach to faculty and students is crucial if the
benefits of the infrastructure are to serve a wider clientele. For liberal
arts institutions, the presence of state-of-the-art infrastructure helps them
to compete with larger institutions, both in terms of research and in
attracting students interested in technology. At the same time, emphasizing
outreach is of special importance to achieve the educational goals that make
liberal arts institutions attractive to students.
Acknowledgments
I wish to thank Ganesan Ravishanker
(Associate Vice President for Information Technology at Wesleyan University)
and David Green for
their assistance preparing this article.
How to cite this work
Francis Starr. "Cyberinfrastructure and the Sciences at Liberal Arts Colleges." Academic Commons Issue Name (Spring 2008): 12 October 2008. <http://www.academiccommons.org/>.- Login or register to post comments
- Email this Essay
Delicious
Newsvine
Facebook
Google
Technorati