
Fault Management
Otto Wittner,Carsten J.E.Hölper,Bjarne E.Helvik
Department of Telematics,NTNU.
E-mail:Otto.Wittner@item.ntnu.no
Abstract
Recently mobile agent technology has been recognised as a potential tool for re-alising distributed network fault management.The autonomy and mobility of such
agents can help ensuring robustness of the management system.A mobile agent
is dependent on a suitable environment consisting of a set of mobile agent systems
(MAS)of compatible system types.In this paper we examine what failure semantics
are desirable for services provided by a MAS when assuming the MAS to be part of a
network fault management system.Based on a general failure class model we define
a new failure subclass.The subclass identifies the usefulness of certain response fail-
ure semantics in MAS-based systems in contrast to traditional client-server systems.
At the end of the paper results from an evaluation project examining state of the art
MASes are presented.
Keywords:Mobile agent system,network management,fault management,failure
semantics.
1Introduction
Traditionally network fault management(NFM),and network management in general, has been implemented based on static architectures[1,7]with high degree of centralisa-tion.Fault information is collected from different network elements(NEs)and funneled towards one or a few central NFM units wherefiltering,correlation and other forms of analysis are performed.Due to the increasing number of NEs being able to report alarm messages electronically,vast amounts of failure information must be transported to the central units.Thus an extra load is put upon the network used by the NFM system which is often a subnet of the network being managed.In some cases this extra load may cause positive feedback and aggravate the failure situation being reported.By distributing the NFM system throughout the network better load balancing is possible.
Centralised architectures are vulnerable to failures if no extra concern is taken.The fact that the whole system depends upon a few nodes in the architecture makes it a candi-date for catastrophes.A single node crash can cause a full system breakdown.For a NFM system this is not desirable.Robustness to faults is essential if faults are to be handled properly.By emerging to a more distribute architecture dependability can be improved.
Recently mobile agent technology[3,2]has been recognised as a potential tool for implementing distributed NFM system[10,2,18].In a distribute architecture the NFMNE
Figure1:NFM using mobile agents
system is divided into smaller units capable of performing work while located different places in the network environment.Autonomy is essential for the small units to ensure robustness of the total system.Being able to migrate between NEs can be important to cope with the dynamics of the network environment.Both autonomy and mobility are fundamental attributes of mobile agents.
For existence,a mobile agent is dependent on a suitable environment.We choose to view such an environment as a collection of mobile agent systems(MAS)of compatible system types.Our view conforms with OMGs MASIF standard[20,9].Figure1illus-trates a network environment where a NFM system is managing NEs using mobile agents.
A MAS can be seen as a service provider for mobile agents visiting or wanting to visit the NE where the MAS resides.Table1shows a list of typical services.Our list has emerged from the idea of a MAS-based NFM system,but resembles work done by other researcher,i.e.[4]proposes a list of facilities that should be supported by a persis-tent MAS,and[20]summarises the set of necessary agent system functions the MASIF standard addresses.
Our objective in this paper is to examine what failure semantics are desirable for the services provided by a MAS when assuming the MAS to be part of a NFM system.We also test how state of the art MASes available today conform with our proposed semantics.
The rest of the paper is organised as follows.Section2reviews failure classes and introduces our new subclass,section3examine the failure semantics of a MAS from a service viewpoint,and section4presents results from an evaluation of selected MAS implementations.In section5we conclude and indicate future research tasks.Category Service
Receives/transmits and(un)marshals
agents
Executes agent instructions
Communication Encodes/decodes and sends/receives mes-
sages
“Provides address and name information
“Provides shared information areas
Persistence Stores snapshots of agent state information
“Enables grouping of agent actions into
atomic transactions
Security Provides mechanisms for secure authenti-
cation and access control for MASes and
agents
“Provides misc.encryption tools for ensur-
ing information security
Other Provides multipurpose database service
“Provides access to management functions
for MAP and NE
Table1:A summary of possible services provided by a mobile agent system,categorised and grouped appropriately.
2Failure Classes
The behavior a server exhibits when it is not able to respond properly(as given by spec-ifications)to a service request,is of great importance seen from a fault management per-spective.Such a behavior is by definition a failure[14]and can be classified into three main classes:Timing,Response and Omission failures[6].
A timing failure occurs when the response is correct but untimely,a response failure occurs when the response is timely but incorrect and an omission failure occurs when no response is given to a request at all.In principle omission failures are a subclass of timing failures and/or responses failures,i.e.infinitely delayed response or a blank,undetectable value returned.Table2lists the main classes and some subclasses taken from[6].
When a MAS is the service provider and the application domain is NFM systems,we argue that one additional failure class is particularly relevant.
A bounce failure occurs when a server forwards an object to a different location than
initially specified.Selection of new location can be random or follow some given strategy.
The new class is not distinctive compared to the classes described above,but rather a subclass of response failures.E.g.if the object is an agent and the agent has requested migration to a certain MAS but is bounced off to a different MAS,the agent experiences a response failure(wrong location)to its migration request.Failure Class Description
Omission No response returned
Crash
Timing Response returned untimely Early timing
“Response to late
-
“Incorrect value returned
State transition failureRecent work on applying principles of collective behavior to problems of the network management domain has shown promising results[5,19,17,18].The idea is based on using a high number of small and simple agents.By letting each of these agents move around and perform simple operations,a powerful collective behavior emerges which again makes the group of agents capable of performing complex tasks.
We argue that in a NFM system based on simple mobile agents and principles of collective behavior our new failure class,bounce failures,shows its significance.In the following two sections we look at some of the MAS services from table1and describe why bounce failure semantics for these services can be advantageous.
3.2.1Basic Services and Bounce Failure Semantics
If one of the basic services fails in a MAS,migration or execution,the agent in question will normally have severe difficulties continuing its mission.Especially if the agent is simple and lacks smart handlers for emergency situations.Self termination can quickly be the only option.
A simple case is when no contact with the destination MAS can be established during a migration operation.Omission failure semantics may seem desirable since it enables the agent to reselected its destination and re-run the migration operation.But a simple agent might not have knowledge enough to select an alternative destination.If the option is self termination,bounce failure semantics can give such an agent a chance to continue its mission.Assuming that the agent is bounced to a random location,it would detect its new incorrect location and maybe retry migration to the desired destination.
Another case is when an agent reaches its destination in an apparently successful mi-gration operation,but fails to execute due to incompatibilities or limitations in the destina-tion MAS.Omission failure semantics for the execution service would imply termination of the agent.If migration and execution are grouped into one transaction[15]the fault would appear as a migration omission failure.In both cases bounce failure semantics would give the agent an opportunity to continue its mission provided that it eventually arrives at a MAS where it is able to execute successfully.
From a NFM system perspective there are several reasons why effort should be put into helping agents escape faulty MASes.
If a certain faulty MAS just“swallows”agents,i.e.the execution service have omission semantics,the NFM system will have little difficulties detecting the fault but will have difficulties locating the faulty MAS.None of the agents experiencing the fault will be alive and able to report.
Failures which cause partitioning of a network are difficult to manage.A faulty NE acting as a gateway between two network segments can typically cause such partitioning(figure1).If the NE provides MAS services(in addition to gateway services)execution of agents visiting the NE may fail due to the faulty state of the NE.Assuming our mobile agent based NFM system lacks an agent factory facility in one of the segments,distributing agents out into both segments will be impossible if the MAS services of the NE have omission failure semantics.Getting one or a few agents across the border could enable the NFM system to establish an alternative path between the two network segments and initiated rerouting of traffic.3.2.2Enhancing Services and Bounce Failure Semantics
A mobile agent will often be able to proceed with its work even if one of the enhanc-ing services fail to respond properly,e.g.the agent can migrate to the next MAS on its itinerary or to its home MAS for error-reporting.Thus omission failure semantics will be desirable for these services in most cases.
Several of the enhancing services provide what could be called a pure information service,i.e.the user requests information and the service responds with an information package.Further,the information in question is often location independent,i.e.the in-formation requested does not have a direct relation to where the information source is situated.Examples of such services are the directory,authentication and database ser-vice.Some of the primitives within a checkpoint service,e.g.“fetch snapshot”,and within a management service,e.g.“get load of network segment”,can also be classified as information requests with some degree of location independence.
Requests for location independent information include a destination address,but can be redirected to a different address and still result in a correct response.Often a redi-rection will be transparent to the requester,except for additional delays(e.g.a chain of standby/restoration servers[13]).Redirection overhead can be avoided by informing the requester of the situation and making him update his initial destination address.The latter kind of none transparent request-redirection scenario can be viewed as a bounce failure, and the service can be claimed to have bounce failure semantics.
3.2.3Disadvantages of Bounce Failure Semantics
Bounce failures belong to the class of response failures which are normally undesirable due to their indeterministic behavior.
Indeterministic bounce failures will occur if a bounce strategy with a stochastic com-ponent is chosen.In such a case a sufficient level of autonomy is required for the agents, enabling them to tolerate unpredictable migration.This can be a challenge if the agents are required to be small and simple.
Agent are executable units,and moving executable units from host to host challenges the security system on the hosts as well as the security measures implemented in the agents.Introducing indeterministic movement does not make things easier.A robust security system is required which again can result in larger and more complex agents.
4Evaluating Implemented MASes
A great number of MASes are available today and many of the popular ones in the public domain.In our evaluation project we selected four freely available MASes.Two are developed by commercial institutions and two by academic institutions.Table3gives an overview of services provided by the MASes with key words indicating functionality for each service.More information can be found in[16,11,8,12].
Services/MAS Aglets1.1Mole3.0
Migration Weak,caching of classes,
static rule set for class load-
ing,both dispatch and re-
tract Weak,classes provided by code server or source loca-tion
Java Virtual Machine Tool Command
Language interpreter
(modified)
Messaging Synchronous(now),asyn-
chronous(future),class
of message-body for mes-
sages sent between MASes
is require to be present in
both MASes Session oriented,syn-chronous,asynchronous
Common interface
(Namespace)to several
dir-services,V oyagers
default dir-service
configurable to be
persistent(file)or
non-persistent
N/A
Blackboard N/A N/A Snapshot,external stor-
age required,reload af-
ter termination provided
Transaction N/A
Two groups:Native with full access,foreign with restricted access.PGP based signing, access control list
Cryptographic N/A N/A N/A N/A
Management API for implementation of
agent and server monitor-
ing tools Resource management through Master Control Process(scheduler).
Table3:Services provided by four selected mobile agent systems(N/A=“not available”).T est Case Service Evaluated Mi Migration
Migration acknowledgement interrupted
Ex Execution
MASflooded with agents
Me Messaging(asynchronous) Message acknowledgement interrupted
MeCorr Messaging(asynchronous)
Table4:Test case descriptions.
4.1Test Environment and Results
A group of three interconnected PC all running Linux2.0constituted our test environ-ment.Most of the tests were performed by interrupting or altering the traffic streams flowing between MASes located on different hosts.Table5shows the observed fail-ure semantics of MAS services for the different test cases.Brief descriptions of the test case are given in table4.More complete documentation of test cases and results can be found at the authors web site(http://www.item.ntnu.no/˜ottow).The observed semantics are strongly related to each test case and should only be considered as an indication of expected failure semantics for the relevant services.
None of the MASes evaluated provide mechanisms for negotiation of failure seman-tics.
5Conclusion and Future Work
Research indicate that a network fault management(NFM)system can gain efficiency and dependability by using mobile agent technology as a mechanism for distribution. The autonomy and mobility of mobile agents is valuable for managing dynamic network configuration in a robust manner.The robustness will depend strongly on the failure behavior of the mobile agent systems(MASes)which provide the execution environment for the agents.Thus being aware of what failure semantics the MASes have at the service level is important.
In traditional NFM architectures response failure semantics for services are generally not desirable since they result in a need of more complex error handling in clients.But in a mobile agent based NFM system some types of response failures can prove to be valuable. In this paper we have defined a new failure subclass,bounce failures,which captures some of these types,and explained why we consider the subclass to be of importance.We have also evaluated a set of state of the art MASes and observed omission failure semantics to be the most common behavior,but still with response failures semantics exhibited in some cases.None of the observed response failure could be classified as bounce failures.
For future work there are several topics requiring attention.As our test results in-dicate more MAS development work is required if MAS services are to provide QoS primitives with negotiable service failure semantics.A more thorough look(by means of analysis and simulation)at the advantages and disadvantages gained by introducing our purposed class of failure semantics is also required.And a lot of work still remains onTest case/MAS Aglets1.1Mole3.0
Mi Omission,
exception
thrown
(long default
timeout)Omission, exception thrown
Omission, exception thrown, (long default timeout)Omission, exception thrown
Ex Response
(value)Omission, crash
Omission, exception thrown Omission, exception thrown
Me Omission,
exception
thrown
(long default
timeout)Omission,no exception
No failure for oneway message(no mesg.ack. sent)Omission, exception thrown
MeCorr Omission/
Response,
exception
thrown on
omission Omission,no exception
Table5:Observed service failure semantics for four state of the agent mobile agent sys-tems(N/A=no results available).agent design/development with focus on how agents can manage specific fault situation by awareness of service failure semantics and by collective behavior. References
[1]ISO/IEC10040(1998-10).Information Technology-Open Sysems Interconnection
-Systems management overview.International Electrotechnical Commission,1998.
[2]T.White A.Bieszczad,B.Pagurek.Mobile Agents for Network Management.IEEE
Communications Surveys,1(1):2–9,Fourth Quarter1998.
[3]V.A.Pham A.Karmouch.Mobile Software Agents:An Overview.IEEE Communi-
cation Magazine,36(7):26–37,July1998.
[4]M.M.Silva A.R.Silva.Insisting on Presistent Mobile Agent Systems.In
G.Goos J.Hartmanis J.Leeuwen,editor,Lecture Notes in Computer Science(1st In-
ternationale Workshop on Mobile Agents MA’97/ISADS’97),volume1219,pages 174–185.Springer-Verlag,1997.
[5]Gianni Di Caro and Marco Dorigo.AntNet:Distributed Stigmergetic Control for
Communications Networks.Journal of Artificial Intelligence Research,9:317–365, Dec1998.
[6]Flavin Cristian.Understanding Fault-Tolerant Distributed Systems.Commmunica-
tions of the ACM,34(2):56–78,Feb1991.
[7]J.D.Case M.Fedor M.L.Schoffstall C.Davin.RFC1157:Simple Network Man-
agement Protocol(SNMP).IETF,April1990.
[8]Dep.of Computer Science,Dartmouth College.D’Agents.
http://agent.cs.dartmouth.edu/.
[9]IBM Corporation GMD FOKUS.Join Submission:Mobile Agent System Interop-
erability Facilities Specification.OMG TC Document,orbos/97-10-05,Nov1997.
[10]German S.Goldszmidt.Distributed Management by Delegation.PhD thesis,
Colombia University,1996.
[11]IBM.Aglets Software Development Kit.http://www.trl.ibm.co.jp/aglets/.
[12]University of Stuttgart IPVR.The Home of the Mole.http://www.informatik.uni-
stuttgart.de/ipvr/vs/projekte/mole.html.
[13]D.Johansen K.Marzullo F.B.Schneider K.Jacobsen.NAP:Practical Fault-Tolerance
for Itinerant Computations.Technical report,Department of Computer Science, University of Tromsø,October1998.
[14]R.E.McDermott R.J.Mikulak M.R.Beauregard.The Basics of FMEA.ISBN0-527-
76320-9,1996.[15]K.Rothermel M.Strasse.A Fault-Tolerant Protocol for Providing the Exactly-Once
Property of Mobile Agents.In Proceedings of the Seventeenth IEEE Symposium on
Reliable Distributed Systems,pages100–108,1998.
[16]Objectspace.V oyager Overview.http://www.objectspace.com/products/vgrOverview.htm.
[17]T.White B.Pagurek Franz Oppacher.Connetion Management using Adaptive Mo-
bile Agents.In Proceedings of1998International Conference on Parallel and Dis-
tributed Processing Techniques and Applications(PDAPTA’98),1998.
[18]T.White A.Bieszczad B.Pagurek.Distributed Fault Location in Networks Using
Mobile Agents.In Proceedings of the3rd International Workshop on Agents in
Telecommunication Applications IATA’98,Paris,France,July1998.
[19]T.White B.Pagurek.Towards Multi-swarm Problem Solving in Networks.In Pro-
ceedings of the3rd International Conference on Multi-agent Systems(ICMAS’98),
July1998.
[20]D.Milojicic M.Breugst I.Busse J.Campbell S.Covaci B.Friedman K.Kosaka
D.Lange K.Ono M.Oshima C.Tham S.Virdhagriswaran and J.White.MASIF-The
OMG Mobile Agent System Interoperability Facility.In Personal Technologies,
pages2:117–129.Springer Verlag,1998.
