Virtual CICS User Group Sponsors
Virtual CICS User Group | November 2023
CICS and Recovery
Andy Wright
IBM Master Inventor
Read the Transcription
[00:00:00] – Amanda Hendley
So welcome everyone. My name is Amanda Hendley. I’m with Planet Mainframe, and I am your co-host today for this virtual user group for CICS. I’m excited to have you here. We’re going to have a great session and as always, before we get started, we’ve got a little bit of an agenda. So quick introduction for the Virtual User Groups. If you’re new, we meet every other month on this Tuesday, so plan to join us again in January for our next session, which we’ll talk about at the end of this session. Today. We’re going to have our presentation, we’ll have some time for Q A, then we will talk about some articles and news and like I mentioned, we’ll talk about our next session. So before we go too far, I do want to take a moment to thank our partners that enable this and all of our user groups to happen. Our CICS sponsors are Broadcom mainframe software, Intellimagic, DataKinetics, and Planet Mainframe. All of these sponsors have some really great resources for you to check out. So I encourage you to go to their user groups, check out their webpages and especially their blogs because there’s a lot of great information there. I’m going to introduce you to our presenter. So Andy Wright is a Master inventor and he has an MSc in Software Engineering from the University of Oxford.
[00:01:49] – Amanda Hendley
He has 35 years experience with the CICS family of products and their associated tools and he has written a lot of articles on IBM software. He’s presented at Impact Conference, Nordic Guide Share and many other technical conferences the world over. And he’s also developed the ITSO Red Books and education classes and CICS customer health check. So Andy, thank you. We’re lucky to have such a great wealth of information. Thanks for joining us.
[00:02:22] – Andy Wright
Thank you very much Amanda and hello to everybody out there. Good morning, good afternoon, or good evening depending on where you are. I just noticed my virtual dial-up was playing up, so if I do drop I might come back again in about 30 seconds, but hopefully we’ll be okay. Seems to have settled down now. So yeah, as Amanda says, I’m giving a presentation today and I’m going to be talking about CICS and the ideas behind recovery, what recovery means and how it works in a CICS environment. A bit of background. I work in CICS Level 3 in the change team, IBM in Hursley. I’ve been there 35 years now working on CICS, going back to CICS 1.6.1. So quite a few releases have come along and I’ve worked on all of those releases since about 1988, specializing in certain areas, databases, recovery files, logging, journaling, those sorts of things. The sort of core of CICS. So we’ll make a start then.
[00:03:35] – Andy Wright
So just a few, just notices and disclaimers to go through at the beginning.
[00:03:41] – Andy Wright
I thought you might like a very brief bit of background about where I work. I work for IBM in the Hursley Laboratories, which is in a village called Hursley near the city of Winchester in Hampshire, so the south of England, and it’s a big IBM software development lab. So products like CICS, MQ and other products have been developed there over the years and supported there globally for our customers. The old house in the photograph that was built back in the 17 hundreds. It was an old manor house. It was used as a hospital in the First World War and it was used as part of the Second World War when the Spitfire airplane was being developed, for example. IBM moved in in the 50s. They’ve built lots of newer buildings around the site. So that photograph was my two sons playing football on the lawn probably about 15-20 years ago now. They’re both bigger than me now, so it was quite some time ago. And just a few sites there of what you can see around where I work.
[00:04:45] – Andy Wright
It’s a nice place to work, very pretty location and a very enjoyable atmosphere.
[00:04:51] – Andy Wright
So hopefully we all know what CICS is. But to set the scene about why we need recovery, you have to remember that all these releases of CICS that have come along over the years have proven to be very successful for a number of reasons. One is that data integrity is built into the product and when you’re acting as middleware, running customer applications, communicating between them and the operating system, it’s very important that you can guarantee that data integrity. So CICS is built on recoverability, ensuring we can undo changes that need to be undone and commit changes that we want to go ahead. And our applications doing all that work in CICS, they can be written in Java, C Assembler, COBOL and PL1. And the newer releases we support newer interfaces, JSON Web services, SOAP, all the things that have come along in the last five to ten years, CICS has embraced as means of getting work into the CICS system.
[00:05:55] – Andy Wright
So I wanted to start off and talk about recovery, but it’s a bit of a difficult topic because wherever you start you’re kind of building on something that you may already have to know. So you could start with CICS Recovery and our Recovery Manager component. Or we’ll talk about these things called ACID properties, Syncpointing and two phase commit, and other quite obscure terms like Implicit Forget, Last Agent, Single Updater. They come along as well. There’s something called Shunting. What’s that all about? For example, we support Backout. Lots of object-oriented classes now used in CICS for this code, units of work, log streams, and you can see that it all starts to build up and build up and build up, and there’s lots and lots of terms that people all have to kind of understand, but there’s no logical approach to it often to make sense, and it can be confusing for people. So we’ll take it nice and slowly and work our way through the mist.
[00:06:56] – Andy Wright
So I thought I’d start off with a quote now it’s quite dry, but don’t worry, it’s going to get a lot more exciting in a bit. This is taken from a book that was written getting on from 40 years ago now, and it’s kind of a Bible or reference material for how to write a transaction processor. And these three guys who wrote it, they stated that the goal for a system like CICS and other transaction processes is to ensure that the work’s done atomically, meaning everything that’s sharing data doesn’t interfere with each other. And if there’s a problem and you terminate or fail, then your changes are undone. If you complete normally they’re all committed. So it’s all one thing or the other. And these are the basic principles that everything in recovery processing has to be built on.
[00:07:52] – Andy Wright
And we talk about these things and we refer to them as ACID properties. So we’ll look at those four to understand what they mean. So the A is atomic, and it’s really saying that what you do, all the work that you do, we’re going to treat it all as one operation. So if you do it and you make some changes, then all those changes are going to be committed. They’re all going to happen, or none of them are going to happen. You can’t have half of them done and the other half not done. So we treat all the recoverable operations in a bit of work as either all happening or all not happening. It’s an atomic set of changes. Consistency means that when you start and when you end your work, your data is in a consistent state. So if you think about moving money between banking accounts, a checking account or a deposit account, if a certain amount of money comes out of one, it has to go into the other, and everything is nicely consistent at the end. You can’t have half the work done over here, but the equivalent work not done over there. Everything is a consistent view. When you’ve ended, much like when you started isolation, very important. What we’re saying is while you’re doing those changes that you want to commit, or those recoverable operations to different resources. Well, we don’t want anyone else to see them. If you’re updating a database or a file with information that might end up being backed out, then until you’ve committed it, we don’t want anyone else to be able to see those proposed changes because they’re in flight, they’re uncommitted, they’re work that we don’t guarantee to be necessarily valid. So we want that invisibility for intermediate operations that our transactions do. And then there’s Durable. And that really means that if you complete, then if you restart your system, system information is preserved. So this data that we change is persistent. If you update a file or a database, those changes will survive across a system even if the system fails, for example. And that’s what the ACID properties are. And transaction processes have to build on those to be able to support data integrity.
[00:10:13] – Andy Wright
So then people say, well, that’s all very interesting, but what is a transaction? And this is where it starts to get complicated, because in the industry, the word transaction has a special meaning. It means this recoverable set of changes that you’re making. So you might update a file, you might update a database, you might insert something into a hierarchical tree, you might read some stuff off a queue, and all those recoverable changes are all grouped together as something called a transaction. And that’s what’s being performed by a Transaction Processor. And like I say, Transaction Processors have to meet these ACID requirements. So all that work is being committed or all that work is being backed out.
[00:10:58] – Andy Wright
And in the industry, and in that book that I mentioned at the beginning, these groups of work, well, that’s known as a transaction, but we don’t call them transactions in CICS, we call them units of work, which has caused confusion over the years, but there’s a reason for that. So in our language in CICS, a group of recoverable changes is known as a unit of work. Other products sometimes refer to them as units of recovery. I think Websphere does that, for example. So if that’s what a unit of work is, what’s a transaction in CICS? Well, in CICS, it’s just the environment for your applications that are performing the work. You run a transaction, and that transaction does various executions of programs and drives business logic. And these transactions in CICS, they could have one unit of work or they could have several. So a transaction is a higher level thing. In a CICS environment, it’s not the same as a unit of work. Very important to remember that. And in CICS, the Transaction Manager component handles transactions, and the Recovery Manager component handles units of work and recovery. Just to confuse things even more, you also get people referring to tasks. In CICS, tasks and transactions often get used interchangeably. And for the purposes of our talk today, that’s absolutely fine. A task is a similar thing. It’s a collection of work being done in an application environment. Yeah, tasks and transactions effectively are the same thing from a user’s point of view.
[00:12:35] – Andy Wright
So these units of work then, that happen in CICS, they contain all those recoverable operations that a task wants to perform in CICS as one logical set of work. So you might want to update some VSAM files, for example, or update some databases, Db2 or other sequential databases. You might read data off queues, MQ, temporary storage inside CICS, or transient data. You might be using IMS and DL1. Lots of different resources can be defined as being recoverable in CICS, and if you change them or update them as part of your program, they’re all part of the unit of work. And at the end you want to commit those changes. When your program’s done what it needs to do, maybe it’s handled moving of data around between bank accounts, like I said before, or committed a purchase on the Internet, or other things that CICS is used for to process transactions. So when a task wants to commit those changes, it’s got two choices. It can just end and give control back to CICS and go away. And when it does that, when we leave the application for the final time, there’s what we call an implicit end of task Syncpoint, a commit that happens almost like a safety net at the end of your transaction finishing. But you might want to divide your transaction up into multiple units of work. It’s not uncommon. Customers do that. And you can commit work as you go along by issuing an EXEC CICS SYNCPOINT command, very simple piece of API. And it tells CICS to commit all the work that unit of work has done so far and kick off a new unit of work to carry on where the previous one finished. So you have a choice of implicitly committing your work when the task ends, or explicitly committing it during the course of the programs by issuing your EXEC SYNCPOINT commands. S
[00:14:45] – Andy Wright
o let’s have a stroll through a unit of work then. Let’s think about this from a time point of view. So that’s the past and that’s the future. And you could imagine that you initiate some work in your CICS region. Now that could be work entered from a terminal, an old green screen terminal for example, or work coming in off the web. Could be work initiated from a web service, it could be work coming in from an MQ trigger. It doesn’t really matter. Some work has come into the all CICS region and we’ve started a transaction to provide the environment to do that work in. And so as part of the transaction we’ll create a unit of work and that will live as part of the transaction through until the task ends or you issue a SYNCPOINT command. Now this task, this could be running a program that could do anything. It could do updates to files, databases, it could invoke other programs. Some things it could do are not recoverable. They’re just ordinary work happening in the system. Some of them are recoverable. You could have files defined to CICS as recoverable, like I say, or databases or queues. And so things that you do in the application are part of the unit of work that need to be committed at recovery time. So eventually you get to an EXEC CICS SYNCPOINT command. Now it sounds like a nice simple thing, a nice straightforward, easy command, but it’s actually a huge piece of work inside CICS to prepare everything that you’ve done and then hopefully commit everything that you’ve done as well. And if you think about the events up to that point, we say everything from the beginning of the unit of work to when you issue that SYNCPOINT, you’re in flight. That means you’re running, you’re making your changes, but you haven’t got as far as wanting to commit them yet. When you enter the SYNCPOINT, we use this method, which is known in the industry as two phase commit. And there are two phases. The first one is the preparation of what you’ve done to make sure that everything’s okay, to go ahead and commit it, and then hopefully it is. And then the second phase is the commit phase, when everything is set in stone. And once that’s done and the SYNCPOINT ends, then all that work is committed, so it can’t be undone. Now unless you explicitly reverse it yourself with some sort of compensation logic, you have to write yourself. As far as CICS is concerned, that’s all committed work. So remember the Isolation component of ACID properties. That’s saying no one can see that work until it’s committed. Well, once you’ve committed it, other people can see it. They can see what you wrote to that file that was recoverable, for example.
[00:17:40] – Andy Wright
So I like to compare a SYNCPOINT with a wedding because I think that’s quite a useful analogy. And my son’s getting married soon, so it’s quite close to my heart at the moment. So you can kind of compare it with a wedding. It does break it down a bit, but we’ll go with it and we’ll see how we go. So when you get together, that’s like the task starting and the unit of work beginning, and you spend some time together. Well, okay, that’s the unit of work doing those recoverable operations that your program wants to update files with or change databases or read queues. And then you get to the point when you want to make that finalized. So in the marriage case, you make the decision you’re going to get married. Well, you issue that exit CICS SYNCPOINT command that we mentioned, and we enter that SYNCPOINT processing in CICS. Now, there are two phases to a SYNCPOINT, the prepare phase and the commit phase. And in the prepare phase, all the work inside CICS IMS checking. It can go ahead and be committed. Well, that’s like in a wedding when someone could stand up and object to the wedding going ahead or not, and assuming that no one does object, then the wedding goes ahead. You write the details in the book and you go ahead and you go through the marriage, and in a SYNCPOINT, you log the fact that you’re going ahead and committing, that’s done for you by CICS. And then all those recoverable changes are committed. They’re set in stone in the phase two of SYNCPOINT. Or there could be a problem, someone may have put their hand up at the back of the wedding and objected, or some other problems come to light. And in a SYNCPOINT, that’s a bit like part of the environment, some component within CICS or some other system that you’ve been using, like IMS or Db2 or Websphere has said, I can’t commit, there’s been a problem, in which case we have to back out. So we log the fact we’re backing out, and then we back it out and undo the work. And again, remember, at the end of that you’re either completely committed or you’re completely undone. And the resources are either all updated into a new state or they’re all put back to how they were before your transaction started. So I find that’s quite a useful way of understanding the events in a SYNCPOINT – if you compare it to a wedding.
[00:20:04] – Andy Wright
It’s quite useful and you can stretch the comparison a bit. So the person who conducts the wedding don’t know who that will be necessarily. Over in the UK, it’s normally either a religious person for a wedding, or could be a civil wedding, whoever’s conducting it elsewhere, same idea. They’re doing the officiating of the wedding, and that’s the CICS Recovery Manager component. And all that paperwork in the wedding, the certificate, the signing of the books to confirm that you’re legally married, well, that’s the data being written to our logs by the CICS Log Manager. So our Recovery Manager that manages all these units of work. It looks at them all, it manages their status and it progresses them through the run of CICS. And its job is to coordinate all those recoverable changes made by local Resource Managers. So that’s things like in CICS file control, or temporary storage, or transient data, or other components that are part of CICS that you might want to be changing. But Recovery Manager also coordinates work with other systems. You could have multiple CICS regions communicating with each other, or you could be talking to IMS, or you could be talking to DL1, IMs, sorry. Or you could be talking to Db2, or WAAS, or MQ or other remote systems. And they’re also part of the distributed communication that takes place. That Recovery Manager has to coordinate. And if there’s a problem, if you have to restart your system, or if your application crashes, then Recovery Manager has to use data on our system log to rebuild things to a committed state once more on a restart. And so the Log Manager is the other component that handles all the log data. So if it’s called to write data, it will write data to our logs. And we have two logs in CICS for system work, DFHLOG and DFHSHUNT. But you can also write data to user logs for other purposes that we’ll see in a moment. And the Log Manager, it maps these logs that you’re using to an underlying infrastructure to hold the data, and that uses log streams, which are part of the Z/OS logger subsystem. So that’s the two halves of CICS working together to manage recovery.
[00:22:38] – Andy Wright
So I’ve talked about logs and log streams. Well, all our recovery data that gets written to DFHLOG, that’s our primary log, our primary log stream, and everything that gets written to it is there purely to allow CICS to recover and undo things if it needs to. So if there’s not a problem, then we don’t have much need to read the log apart from on a nice restart. It’s more there to be considered as an insurance policy if you have to undo things. There’s a secondary log stream which we call DFHSHUNT, and that’s provided for long duration data, things that hang around for quite some time. So that could be transactions that fail during SYNCPOINT and their units of work get shunted. Hence one of the reasons why it’s called DFHSHUNT. But it can also be for data that we have to hang on to for a long time. They are long running transaction that doesn’t SYNCPOINT very often, for example. So there are these two log streams, and those logs we map to the log streams managed by the z/OS system logger. And the log streams, they can be either on a coupling facility or on DASD. And some of our customers have CF log streams, some have DASD log streams. CICS doesn’t care, doesn’t matter to us. It’s all handled by the z/OS logger. So we don’t mind which media is being used. It doesn’t make any difference to us. But there are other logs and journals that are used within CICS, and they’re what are known as general logs or user logs. And they’re not for backout, they’re not for recovery of resources or undoing changes that need to be backed out. So the general logs are for other more bespoke purposes. So file control in CICS has this old mechanism called Autojournaling, where you could journal information about files when you were changing them. You could preserve what the file had in it before and after a change, or you could log data about changes to files. And then more recently, file control also supported Forward Recovery, which is the ability to recover data sets after some physical problem, say a head crash in disk days when disks used to fail in that way. So Forward Recovery journals, again, part of CICS file control, and that’s after images, that’s data as it looks after you’ve changed it, as opposed to the before images that we write DFHLOG, DFHSHUNT to undo things. CICS also supports something called Replication logging, and that’s for replicating changes to VSAM files to a disaster recovery site, so that you can have a remote version of a file somewhere distant from your primary site. And you can use the log data can be used there to rebuild resources remotely. And there are other reasons why some customers use general logs. You can log the input and output screens of terminals, for example, for auditing purposes and tracking of what data has been sent to and from screens. And you can have security audit logs as well. So there’s a number of reasons, and these can all be mapped to different general logs and therefore to different log streams and written out by CICS in the same way as we do to DFHLOG and DFHSHUNT.
[00:26:17] – Andy Wright
So I’ve mentioned backwards and forwards recovery. Well, those system logs, DFHLOG and DFHSHUNT, they’re only used for backward recovery. So it’s before images, it’s what something looked like before an application changed it. Because if you fail, we need to read that data back to undo it and restore those recoverable resources to their previously committed state before the application that has had a problem updated them. There’s also this thing called forward recovery, and CICSVR is an example of a product that was provided to perform that, and that takes after images, what things look like after they were changed for VSAM records, as I mentioned earlier. Typically used if you want to ensure data sets can be recovered after there’s been some hardware problem and you’ve lost the data after it’s been committed. And those general logs I mentioned, well, they often can be used for non recovery purposes at all. They can be used purely for auditing or security, like I say, or keeping track of changes that were made to non-recoverable resources. So there’s a variety of reasons why you might end up wanting to log data, and it can be for backward recovery or forward recovery, or just general recording of information.
[00:27:36] – Andy Wright
Now, depending on how you start your CICS region, we use the log data in different ways as part of our system startup, and there’s three main ways you can start a CICS region. The most basic is what’s called an INITIAL start, and that’s a completely fresh run of CICS. So you’re starting it INITIAL, you’re saying, I don’t care about anything that’s happened before, I’m going to start completely fresh. So any work you did before that was on the log gets deleted and it’s a completely fresh run, tend to do that to ensure that everything’s tidied up and you’ve got a nice environment. And some customers always INITIAL start. Some customers very rarely INITIAL start. but that’s the, the basic startup type. Then there’s a thing called a COLD start now, a COLD start many, many years ago was what an INITIAL start is today. But a COLD start today is not quite as COLD as it used to be. It’s not quite as Arctic as it were as an INITIAL start. On a COLD restart, then we don’t care about local things that we’ve done. So if you’ve got changes to local resources, we don’t care about those. Everything’s picked up from the CSD, from your RDO groups, as it would be for an INITIAL start. But we do preserve a little bit of information on the log about work that was happening in the previous run that had remote obligations. So was communicating with other systems, for example. And we do that because when we then reconnect with those systems, we need to have that information to answer questions from them about how far that work progressed and whether that needs to be undone or committed. So a COLD start is much like an INITIAL start, but there is some data preserved. But normally, customers would run their CICS region and have a controlled shutdown, and then they do an AUTO restart, which means they would bring it up to how it was before. And that’s what we will call a warm AUTO restart. So you shut CICS down in a controlled manner. You AUTO restart it, we detect it was shut down in a controlled manner, and everything comes up looking like it did before the shutdown. So things that were open are reopened, things that were closed or left closed things look as they did before the shutdown. But you could imagine that your CICS region had a failure. There was a power failure, an outage, or some serious logical problem, maybe something happened that couldn’t be recovered and the region can terminate. And then an AUTO restart would say, ah, there wasn’t a controlled shutdown, I need to do an EMERGENCY restart. And that means bringing CICS up, much like a warm restart. But then any work that was being done in the system, all those units of work that we talked about that hadn’t got as far as committing, but have made changes to resources, we’ve got to back them out. So EMERGENCY restart will undo anything that was in flight that it finds on the log as part of the restart processing. So all those different types of restart, they all make use of our system log in different ways.
[00:31:03] – Andy Wright
Now, the way they do that is by reading data on the log, and we write data to our system log in a block of records. So these are blocks of z/OS logger log data, and we use the z/OS logger to write out blocks of data to the log streams. But within these blocks we have information about all the units of work that are in the system at the time, and we have their records scattered throughout the blocks in what we call chains of data. And these chains are associated with units of work. So when you have to read the log back, when CICS starts up, if it’s an EMERGENCY restart, we have to read the log back sequentially, one before the other, to work out where we are. If it’s just because the transactions are bended and we go into dynamic transaction backout to undo one task in the region, then the Recovery Manager can read back just one chain rather than all the chains in the system, so it’s much more efficient. And this data on the log is maintained and coordinated between the Log Manager and Recovery Manager. So as a user you don’t need to worry about the format of it. It’s all handled for you by CICS.
[00:32:17] – Andy Wright
So just to give you an idea visually of what I’m talking about, again, we’re going from the past on the left to the present, and blocks of log records could have data for multiple different transactions. Each one’s got its own unit of work, and each block may or may not contain data for different transactions. It may contain data for the same transaction more Than once if it’s done several updates, for example, or done work that needs more than one log record. And maintained within the information that we log are pointers so we can read back a particular transactions unit of work data if we need to. Or we could read back all the system data on the log if we need to. And that’s all done automatically for you within CICS recovery processing.
[00:33:11] – Andy Wright
So why would we back something out then? Well, I mentioned dynamic transaction backout, and that’s saying I want to undo a transaction that’s had a problem. And you may have had an ABEND, you may have issued an Abend, or you may have had some ABEND thrown at you when you’re running your task in CICS. Or you may have issued another command, an EXEC CICS SYNCPOINT rollback command, and that rollback adverb, it means don’t commit everything forward, don’t make everything set in stone. Undo everything I’ve done. Because you might decide you need to undo what your work you’ve done up to this point. So whether it’s an ABEND or whether it’s a SYNCPOINT rollback, then control will get into CICS recovery and our Recovery Manager will call the Log Manager and it will read back that UOW’s log data. And then remember that we’re getting data off DFHLOG and potentially DFHSHUNT. And that’s all before images. It’s all information to undo things. It’s the insurance policy to undo work. So if it’s a file update, for example, we’ll give that information to CICS File Control to undo that particular piece of work to a file. If it’s a temporary storage queue, we might call the temporary storage component to undo that change. Same with transient data, for example, could be remote obligations, it could be work to IMs or MQ or WAAS. And depending on what we’ve logged or what we need to undo, we will ensure that the appropriate client is called to undo that work.
[00:35:00] – Andy Wright
That’s known as dynamic transaction backout. And like I say, it runs down one of those chains. It’s optimized to just undo one piece of work in the system.
[00:35:11] – Andy Wright
Now, if you think about our system log, there’s an awful lot of data on there, and it’s always rolling forwards. We’re always writing more and more records. Now, we don’t ever really want to read that data unless we have to do some backout work. And if everything’s working well, we don’t need to read it. And those transactions inside CICS, they’re going to end. They’ll have syncpointed and committed their work. They’ll have gone away. And all that log data they wrote is totally irrelevant after that because that work got committed. So if we didn’t do anything about it, then the log data will build up on those log streams and they become unmanageable.
[00:35:55] – Andy Wright
So periodically, CICS has to tidy up log data on its system logs, DFHLOG and DFHSHUNT. And what I mean by that is it has to delete big ranges of data, tranches of data on that log that it knows it’s got no interest in anymore. So what I’ve tried to show here is visually seven different tasks running in CICS. So seven units of work, and they’ve all written some log data to the log stream, and I’ve drawn them as color bars on the system. And what you can see is at this particular point, when we look at the log, the oldest data is that data for that yellow set of records. There’s nothing in the system that goes back beyond that. So we don’t care about any data on the log stream before that point. So periodically, CICS will tell the z/OS logger to throw away a range of log data from the log stream. And we call it “trimming the log”. It’s the history point of the log stream and it’s the oldest chain history point on the log. And so we would call the z/OS logger and it would delete everything up to that dotted red line. And that’s done when CICS does an activity key point, which is something that happens periodically in a run of CICS, and we’ll look at activity key points a bit later on. But it’s a bit of housekeeping and it’s an opportunity to do some tidy up work, such as deleting old log data we don’t care about anymore.
[00:37:48] – Andy Wright
Now if you imagine time moves on and other transactions have ended and what have you, and another key point comes along and we get to this point and at this point the red, the yellow, the blue and the orange transactions have all ended because we’re now at the point of now in time. And the only two transactions that are still in CICS now are the green and the black ones. And so we say, okay, right, so the oldest data we care about is for this black unit of works, log data. And now we can trim the log to the new red dotted line position and throw away that data to the left because we’re never going to need to undo work done by the red or the blue or the yellow or the light blue or the orange transactions because they’ve all committed at that point in time. So every key point, ideally we do this and we throw away old log data at the back end of the log streams while we’re busy writing new data to the front of the log streams. It’s like a moving, like a wave, if you like, rippling through the log stream. I think Americans call it the wave at sporting stadiums. I don’t know. Anyway, kind of moving down the log stream, doing that over time.
[00:39:13] – Andy Wright
So we talked about two phase commit and a SYNCPOINT, or a two phase commit is driven when applications issue an EXEC CICS SYNCPOINT command or the transaction ends, that implicit SYNCPOINT that I mentioned. Or you could issue an EXEC DLI terminate command, for example. There’s other options, but it doesn’t really matter. Something’s caused CICS to decide it needs to commit the work that’s been done by that unit of work. And so we go into that two phase commit processing, and it’s in two halves. The “Prepare” half, you know, “anyone object to the wedding?” and the Commit half, you’re going ahead and getting married. And within the Prepare phase there are two bits that we have to do. We have to ask all the local resources are you happy to go ahead? Now, these are things like File Control inside CICS, or Temporary Storage or Terminal Control or transient data, and they ought to be ready to go ahead, because if there’s been a serious problem up to now, you’ve probably had an ABEND and you’ve already been backed out by now. So if you get to SYNCPOINT, all the local resources should be able to go ahead normally, although they can still fail for some reasons or other, but normally they’d be OK. So we’ll ask them all if we’re all right to go ahead and they’ll all normally say yes, and only then do we ask all the other systems that we’re talking to. So you could have had a front-end CICS region in old days connected to VTAM may well still be, or it could be a socket owning region where work comes in from TCP IP, or a web owning region, or work coming in from other regions. And that could be communicating to a number of AORs, (Application Owning Regions), running your applications, and they could be communicating with Backend Regions, File Owning Regions or Database Owning Regions, or Queue Owning Regions or other CICSsystems if you’re DPling from one CICS to another. So it could be quite a complicated web of remote systems and we have to ask each one, are you ready for us to commit? And they all have to vote yes to be able to allow us to carry on. So we’re saying, “Have they all voted yes?” And if they have voted yes, if one comes back and says yes, “I’m happy to commit.” Then they have to be able to commit later on when we tell them to. They’ve basically said, I’m definitely able to commit. And at that point these remote systems, they go what’s known as In Doubt. They enter this in Doubt Phase, waiting and twiddling their thumbs to know what they need to do. Is someone else going to complain down the line and everyone get told to back out, for example, or will no one have a problem? And so we tell them all to go ahead and commit. So there’s other regions, they’re In Doubt, but CICS doing the SYNCPOINT, the coordinator is never In Doubt. It knows what it’s doing at any one time based on these votes, and assuming that all the local systems are happy, the local resources haven’t got a problem, all those remote connected regions, other CICS systems, other Resource Managers like IMS and Db2 and WAAS and MQ, assuming that they all vote yes, then we say, right, we’re going ahead and we log the fact on our DFHLOG that we’re committing. And that’s the Holy Grail that’s saying I’m committing. From that point on, this unit of work has to commit. And then we go into the commit phase and we go and we call all those remote systems to commit and then we call the local resources to commit. And that’s what happens. That’s the two phase commit. And it’s being done by a system like CICS, which is a Resource Manager, a transaction processor, a Resource Manager. And a Resource Manager has to have two things. It has to have a log that it manages, which, and it has to manage its own logs to able to ensure things are done in a coordinated manner. So regions like CICS, IMS, Db2, WAAS, they are all Resource Managers because they have their own recovery logs and they manage their own locking. Other regions, other systems, things like VSAM RLS for example, that will do its own logging, but CICS manages its logging. So some Resource Managers are true Resource Managers that do their own locking and logging and other ones are handled as part of CICS’s own preparation work.
[00:43:55] – Andy Wright
Now a backout, which is a commit backwards, again, you can get those, like I say, if you do a SYNCPOINT rollback command, or if you have an ABEND, which is more likely, and that’s a single phase process, you don’t prepare to back out. You’re just called by Recovery Manager to commit all those remote systems. And then you’re called by Recovery Manager to commit all the local resources, the files and the Temporary Storage Queues and the Transient Data Queues and what not. And you’re told to commit them backwards to undo their work. You don’t need to prepare to back out. Many years ago, when I joined the team, one of the wise old developers said to me that any Resource Manager worth its salt should always default to backing out. That should be the agreed position from all these regions that are connected together. If there’s a problem, we agree, we’ll all back it out if we can. And much like committing forwards, we log the fact that we’re committing backwards in the same way. So everything happens in CICS that matters for recovery, we will log on our DFHLOG stream.
[00:45:05] – Andy Wright
Now I mentioned shunting earlier. Now shunting is quite clever. It’s something that was introduced a few years in ago now, and it was introduced because of a problem with that In Doubt window that I mentioned, that period when you can go In Doubt, when you’ve been asked to prepare and you’ve said, “Yes, I’m prepared.”, and you’re sat there waiting to be told to commit, and that could take a long time. And for example, what used to happen was somebody might put a JCB through a communication link and all the regions can’t communicate with each other anymore. So what do you do? You’re In Doubt. You don’t know whether you were going to be told to commit forwards or that you might back out. You might default to back out. But what if other regions have started committing forwards? It’s a problem. You don’t know what to do. So as part of a big piece of improvement to CICS several years ago, our Recovery Manager domain was introduced, and that supports the ability to shunt units of work. And if you fail at a key moment, if you’re in that In Doubt window when the comms failure happens, or there’s some problem when you’re committing or backing out, we can say “Right. I’ll take that unit of work that has a problem and I’ll shunt it. I’ll put it to one side and I’ll wait till whatever caused the problem gets resolved. So someone comes out and splices the fiber optic link together again, or someone restarts the database that you’re communicating with, or the machine gets turned on again on a remote site. When that happens, we can unshunt the unit of work and carry on committing or backing out, or find out which way the In Doubt failure ended and commit or back out. When we shunt, we release all the things that don’t matter, like terminals and user programs and your working storage. We ABEND the transaction because none of that really matters anymore. All we need is that unit of work, the log data, and locks on the resources that matter so no one else can touch them. You can’t manipulate data that’s been associated with a shunted unit of work. It’s locked. It’s in a long term lock until CICS is told it can unshunt and have another go at completing that unit of work. And because shunted data tends to hang around for a longer time than normal transactions that come and go in a fraction of a second, it could be there for seconds or minutes or hours or days potentially. And if we don’t move it from DFHLOG to DFHSHUNT, that data on DFHLOG will grow and grow like we saw earlier. So there’s special code in Recovery Manager that says, right, this old data, I want it to hang around for a while. I’ll move it to DFHSHUNT, which means it’s going to be there for a while and we can carry on using DFHLOG optimally for short-term bits of work. Now, unfortunately, although it’s named DFHSHUNT, it’s not just for shunted units of work and their log data. It’s a bit of a misleading name, but we’re kind of stuck with it now. It really means anything that hangs around for a long time. So transactions that don’t SYNCPOINT and update hundreds of thousands of records, for example, they can potentially have their data moved to shunt just because they take a long while to complete, and we don’t want them to stop us tidying up DFHLOG.
[00:48:36] – Andy Wright
So if you think about a task running in CICS, well, while it’s busy doing its application code, it’s in flight, and then normally what happens? It will do a SYNCPOINT and it will commit, everything works and the transaction will end. But it could be in flight and it could have an ABEND or be told to do a rollback, and we’ll go back out and that will work, and then we’ll end. So that’s the normal path around the outside of the baseball diamond, if you like. But you could be on a remote system that’s been told to prepare and is waiting to be told what to do by the remote coordinator. So you’re In Doubt, and while you’re In Doubt, you’re not sure what to do. And normally, because you don’t normally get communication failures after a while, you’re told to go ahead and commit, or you’re told to back out, and then again, we end. They’re the normal paths, but the paths where you might end up having a problem that needs us to shunt the unit of work are if there’s a communication failure while you’re In Doubt, or if while you’re committing or backing out, which don’t forget, you have to do, because you’re committed to backing out or committed to committing forwards if you’ve said you’re going to do so. If there’s been a failure doing that, we have to shunt unit of work until that can be resolved and undone as well. Those three bottom circles are the cases when we would take a unit of work and we would shunt it. It all happens automatically. It’s all done automatically for you as part of CICS processing.
[00:50:11] – Andy Wright
Now, there’s a few terms that get mentioned from time to time, and these aren’t specific to CICS, they’re industry-wide terms to do with two phase commit, and it’s important to understand what they are. If you’re, for example, discussing a problem on a case with IBM support personnel, you may have an interest in understanding what these things are, if it relates to the problem, or if you want to understand what your system is doing from a performance point of view. So we have to tell every remote system to prepare in a SYNCPOINT, like I said earlier, but we can optimize it a bit. And the last one, the final one, we could just tell it to commit directly. We don’t tell it to prepare. It’s a direct one phase commit, and that’s what we call the Last Agent. It’s the final one in the collection of remote systems that we deal with in an application. And CICS takes the coordination role it’s got, and it passes that on to that remote system. And that remote system is now the coordinator. And if it commits, great, it tells us it committed, and we then commit. If it backs out, it tells us and we back out. And us, the region that did that, that passed that role on is now In Doubt with respect to this new coordinator. That response that comes back makes us the coordinator again, and then we then do what the other system tells us to do. If it committed, we tell all the other remote systems to commit. If it backed out, we tell all the other remote systems to back out. Now, why would you do that? What’s the point of Last Agent? It sounds overly complicated. Well, if you didn’t do it, you’d tell everyone to prepare, they’d all say they’re prepared, you’d log it, and then you tell everyone to commit. And that’s not an efficient use of network communications. You don’t need to do that in the last connection case, which is why we have this optimization of Last Agent. And you could have daisy-chained regions where CICS one calls CICS two, which calls CICS three. And that coordination role, the coordination hat, can get passed all the way along the daisy chain. So that’s quite common. And you see Last Agent used often in two phase commit implementations.
[00:52:28] – Andy Wright
There’s something called single updater, which you may have heard of as well. And this is a very optimized form of Last Agent. And that’s saying, I’m going to make you the Last Agent, I’m going to give you the coordination role. And by the way, I’ve done nothing recoverable here at all. There’s been no logging here. I don’t really care whether you commit. I don’t really care whether you back out, it doesn’t matter to me because I’ve got nothing to do. I’ll just make a note of what you did. So that means that we’re not In Doubt. We’re not In Doubt with what the new coordinator decides to do. We don’t care one way or the other. So CICS supports single updater and we can optimize SYNCPOINTs. Even more than a Last Agent optimization, if we decide that we can do that, depending on who we’re talking to and what’s been done in the CICS region.
[00:53:18] – Andy Wright
Something else you might hear about is implicit forget. And this is another optimization, because when we tell everyone to commit, we send a flow to commit, and they could all come back and say, “Yes, I’ve committed.”, “Yes, I’ve committed.”, “Yes, I’ve committed.” And then we could throw our unit of work away, but that could be slow, it could delay things, and there’s more flows than we need, so we don’t always do that. So for some of our connections, MRO, for example, with IRC, to remote fixes on the same LPAR, then we use this implicit forget mechanism and we retain the unit of work, but we decouple from the comms layer so that session to that remote system could get reused. And we could hang on to the unit of work in the background. And suddenly that session will get used again by some more work. And when it does and we find out about it, we go “Ah well, I know an earlier unit of work was using that, and I was waiting to hear back that it committed when I told it to, and someone else is using it now, so it must have done, it must have successfully committed. So I can now forget it.” Now this can cause some interesting diagnostic side effects. If you’ve got an environment, say you’ve got a market open on a Monday morning, that’s the busiest time for your CICS regions, and then the work dies down and you don’t get another peak of work until the next Monday. Then when the peak is happening, there could be a session which doesn’t get reused for a week potentially. And all that means is that we have to hang on to that unit of work internally until we hear back when that session gets reused. It doesn’t matter, it’s not a recovery issue, it’s not a data integrity issue. It just means there’s some state within CICS that hangs around until we hear back when that session gets reused, which like I say, could be very quickly, or it could be some time. And some customers have said they’ve used inquiry products, they’ve inquired on the resources in their CICS region, and they’ve noticed these old units of work waiting on a Forget flow. And that’s the reason they’re there, so that we can know for sure that the earlier units of work has committed.
[00:55:39] – Andy Wright
Right. So I mentioned key pointing. So we’re getting towards the end now. We’re getting down to the nitty gritty, if you like. Now, key pointing is something that we do periodically within CICS, and it’s driven by this thing called AKPFREQ, which is a CICS System Initialization Parameter (a SIP parameter) and you set that to a number or you default to a number. And what that means is every time we’ve reached that number of log rights to DFHLOG, we’ll kick off an internal CICS system transaction called CSKP. Now, you can’t type in CSKP, you can’t make it run yourself. It only happens when we want it to. And that will do housekeeping for the region. And part of what it does will log data to DFHLOG about what the system looks like at the moment. If you’re doing a warm shutdown, we’ll log the environment as we shut down so we can recover CICS on a warm restart based off that, for example. But something else it will do, it will trim DFHLOG and DFHSHUNT like we saw earlier with those colored lines, and throw away data we don’t care about anymore. And it will do that trimming by calling the z/OS logger to delete ranges of data on those log streams. Now this is important because if you’re not Keypointing properly, or if they’re not successfully trimming those log streams, you can enter performance problems or you can have other issues building up in your region. So it’s important to be aware of these key points and what they’re doing and the important messages to look out for in the job log and the message user data. You’ve got DFHRMO 205, which is saying you’ve done a key point. That’s every time we do one of these CSKP transactions, you’ll see that. So depending on how often you see those, that’s how often CICS is having the chance to do housekeeping. Now, it doesn’t matter if you do key pointing more or less often, there’s no real hard and fast correct value. But the important thing to be aware of is that if you keypoint more often, we’re trimming the log more often. But you’re using up more resources to run these transactions more often. Whereas if you keypoint less often, we trim the log less frequently, but you’ve got less CPU being used because you’ve got less CSKP tasks being attached in the system, and that can take a bit longer to read the log back on a restart. So there’s trade-offs depending on what you set your AKPFREQ value to. But when it runs, if we can delete log data, you’ll get this LGO743 message saying, “Yeah, I could trim DFHLOG, or I could trim DFHSHUNT.” and that’s good. If every key point, which is a DFHRM0205 message, is followed by a DFHLG0743, that means it’s successfully able to trim the log. Sometimes we can’t trim the log. So if you’ve got work in the system that’s not ending and these key points are coming along, could be a long running transaction, doing batch style work – hundreds of thousands of updates to files or databases, which is not ideal, not what CICS’s programming model was intended for. But people do do it. They might port batch work into CICS, for example, and until that work completes, we can’t trim the log, because the unit of work is still there, its log chain is still there. We might need it to do a backout. So you might see a number of LRM0205s without an LG0743. But what we do is we tell you that we couldn’t trim the log, the DFHLOG, by putting out these LG0760 messages, and a few of those doesn’t matter. But if you get 1, 2, 3, 4, 5 and this number is going up and up and up the trimnum in the insert of the message, and it’s never going away, we’re always unable to trim the log, that typically means there’s something in your system that shouldn’t be. There could be something that’s looping, for example, and not ending, or it could be some huge piece of work, like I say, some batch work that maybe is not optimized to run in the region. So these are important messages to be aware of. Look out for those, and look out for multiple repeated LG0760s on keypoints, because we haven’t been able to trim the log properly.
[01:00:14] – Andy Wright
Now that log trimming is done, because we want to tidy up unwanted log data. And we do it by calling the z/OS logger subsystem to delete those ranges off DFHLOG and DFHSHUNT. Now, these deletes, they’re logical deletes, they’re not actually deleting the data off the log stream. They’re telling the logger, “We don’t care about that data anymore.” But the logger itself has got its own housekeeping. And when you define a log stream, you define a number of parameters to it. Two of them are the HIGHOFFLOAD and the LOWOFFLOAD percentage parameters. And these are telling the z/OS logger to do housekeeping work. This has nothing to do with CICS. Now, this is all down at the z/OS logger. So when the log stream builds up and reaches that HIGHOFFLOAD percentage, say it’s 85% of the log stream is full, for example, then the z/OS logger will kick off housekeeping for it to tidy things up, and then it will start physically deleting those log blocks that we’ve told it to logically delete and throwing them away.
[01:01:27] – Andy Wright
And it will do that until one of two things happens. You get down to the LOWOFFLOAD percentage, which, great, you’ve done what you need to do, you’ve tidied it up and everything’s fine, or you get down to a point above the LOWOFFLOAD load percentage, and you still have data that we can’t throw away in the logger because we haven’t told it to delete it logically. So then z/OS logger will do offload I/O, and it will move that data from its primary storage to secondary storage, which is offload data sets, offload DASD. So all this offload housekeeping in the logger is a good thing. It’s what the logger wants to do to tidy up periodically. But what it wants to do is to tidy up by physically deleting logically deleted log data down to LOWOFFLOAD. That’s its objective. If it can’t do that, if it has to do this offloading to DASD, then you’ve got I/O to write that data out to DASD. And that’s not so good. So much like you can have experts who tune CICS regions for performance, the z/OS logger can be tuned for performance, and your log streams can be defined to perform better or to ensure that offloading is more optimally handled, depending on these parameters, like HIGHOFFLOAD and LOWOFFLOAD. And you can look at the SMF88 data that the z/OS logger generates. And much like you can look at CICS SMF110 log data to see SMF data, to see how CICS is performing, you can do a similar thing with the SMF88 z/OS logger log data so important message, then check for those messages. Are we not key pointing often enough because that could stop us from reaching LOWOFFLOAD, or are we not trimming because, say, something’s looping in your region, so we’re not able to trim the log logically delete log blocks, which means that the z/OS logger can’t physically delete them when it gets to HIGHOFFLOAD? These are all things to look for if there’s an issue with the performance of the logger.
[01:03:33] – Andy Wright
So I mentioned the z/OS logger. Well, it’s a separate address space, so if you think about your LPAR running z/OS work, you might have a number of CICS regions running as their address spaces in MBS. So CICS A and CICS B, and within them they’ve got the log domain, the Log Manager component that I mentioned. And that’s where we in CICS call the z/OS logger. And the z/OS logger is known as IXGLOGR – that’s its subsystem name. And there’s one of those running per LPAR. Now, there’s a number of things that you have to do when you set up the z/OS logger, because you want to store your data – and the data could be stored on a coupling facility, which is an optimized piece of accessible, rewritable fast access memory – so the logger needs to know how to deal with data on log streams on the coupling facility. And you have these coupled data set entries for the LOGR component, which give the information to MVS about the attributes of the log streams that the z/OS logger is going to use. And as well as these log streams, there are things called staging datasets, where data can be held by the logger and there can be offload data sets, those secondary data sets, and there can be tertiary storage onto emulated tape, I believe it is. So you’ve got primary in the coupling facility and the staging data sets, primary storage in the logger, you’ve got secondary storage on offload data sets, and you’ve got tertiary storage as well.
[01:05:15] – Andy Wright
And the log component, the z/OS logger can use one of two destinations. We mentioned this earlier. It can use a coupling facility or it can use DASD only logging. So let’s think about the coupling facility for a minute. So now you’ve got z/OS running that system logger address base, and that knows about log streams that are defined to a structure which is a component that lives on a coupling facility. And those log streams can hold data. And this is primary storage, primary storage for the log data in those log streams on the coupling facility. And the logger can then duplicate that. It can duplicate it in memory in a data space owned by the system logger address space, or it could duplicate it to a disk, the staging data set, and you can define on your settings what you want. And then you’ve got those offload data sets I mentioned where data gets moved to when the z/OS system logger has to do its housekeeping and it can’t delete enough data from the primary storage. That’s what we do with coupling facility log streams. Lots of customers use CF logging, but we also have customers who use DASD only logging. Like I say, CICS doesn’t care. It’s all below the level of CICS down at the z/OS logger level. Now with DASD only logging you haven’t got a coupling facility, so your primary storage is those staging data sets. So data is being written to disk, and that’s where it’s being written to and read back from by CICS if you have an ABEND or an EMERGENCY restart and it’s duplicated to a database within the z/OS region owned by the system logger address space. And once again you’ve got those offload data sets at the backend. So just to emphasize, CICS doesn’t care one way or the other whether it’s a DASD only log stream, or a CF log stream, it’s all handled by the z/OS logger. As far as we’re concerned. We’re writing data out to a log and we’re reading it back if need be. And the log data is handled by the system logger in z/OS, and that writes and reads the data for us.
[01:07:38] – Andy Wright
Right? So we’re almost at the end now, so sometimes people want to analyze the data on the log. Now, those system logs we mentioned, DFHLOG and DFHSHUNT will be right to them and we might read them back. If you have a unit of work for a task that’s ABENDing and needs backing out, or if you’re restarting your region, doing an EMERGENCY restart, or a warm restart, and reading back down those logs. So they are readable. The general logs we mentioned, the autojournals, the terminal audit trails, the file control autojournals, we never read them back in CICS. They’re only ever written to. So once we’ve written them out to the logger. We don’t really care about the logger holding it in a fast access place for reading if need be. It can be moved off to secondary storage and vanish off down the stream of offload data sets. Because general logs tend to be read back by offline utilities at some point in the future, not necessarily while CICS is running. Could be weeks or months later, for example, when an audit log is referenced. So we don’t really need to read them back quickly, and they’re never read back within CICS. Now there are a number of utilities out there that will let you read log streams, and CICS provides one called DFHJUP. It’s been around a very long time, and DFHJUP, you can point it at a log stream and you can print off the data on it, or you can copy log data to disk if you want to create some backups onto DASD datasets instead of the log stream, for example, it’s a very powerful utility, DFHJUP. There’s lots of information about it in the CICS documentation. Now, some customers still have old log data on their general logs. That’s in an old style format, and this goes back many, many years now when CICS wrote data in a different way. Now we don’t ever read it back. We don’t need that for our system log, but it’s possible there are some old programs, old utilities out there, batch programs, that still need the data presented to them in the old format. And so DFHJUP gives you an option. If you tell it you want to be compatible with the old CICS 41 format, then we’ll give you back the data for your general logs in that form, and then the old programs are able to refer to it as usual.
[01:10:09] – Andy Wright
So I think we’ve come to the end, and I think we’ve got the Q&A full. So I’m going to stop sharing now, and then we’ll see. If anyone has any questions. I’ll happily take any that anyone’s got. Let me stop sharing. There we go. Right, so if anyone has any questions, I’ll happily take any. If anything’s come through, Amanda, that people might like me to answer, I’ll happily do.
[01:10:41] – Amanda Hendley
So you’re welcome to drop them in the chat, or if we have anyone on the call that wants to raise their hand and voice them, we can do that as well. But I have a first question for you. Roughly what percentage of customers CICS resources are defined as recoverable as opposed to non-recoverable okay, well that’s good. Quite a few resources are not recoverable because you don’t always need to be recoverable. So in my experience, often you get a mixture. It could be 50/50 in some sites for non-recoverable data and information that really matters, that you care about being undone or backed out, you would define as recoverable. So it can vary between any real percentage. But 50/50 is not uncommon at all in my experience.
[01:11:37] – Amanda Hendley
And tied to that, before we get to this next one, why would you have non-recoverable resources?
[01:11:44] – Andy Wright
Okay, well, you might have read-only files, for example, that are only ever read by your CICS applications. So if you’re never going to change the data in that file, in VSAM, then you don’t need to log anything because you’re only ever reading them. You’re only ever looking at information and getting it back. It’s never going to change. So there’s no point defining things as recoverable if you don’t need to. Another region could be, it could be a test region, for example, when you don’t really care about recovery, you’re just proving that everything executes correctly. Or it could be files, I suppose, that get scratched and redefined every day that aren’t that important, that don’t really matter if anything goes wrong within one day, because you’re going to start with a clean version again the following day, again probably in a test environment.
[01:12:35] – Amanda Hendley
Great, thank you. Got a question in the chat, what is a good HIGHOFFLOAD and LOWOFFLOAD value combination?
[01:12:44] – Andy Wright
Okay, well, it depends. For DFHLOG, I think we recommend something like 85% for high because you want to get up near the top of the log stream, but still have a bit left over for new work to be put into while the z/OS logger is throwing away all those old records I mentioned. So HIGHOFFLOAD for DFHLOG will be about 85%. LOWOFFLOAD, I think it ranges between 40 and 60 as a recommendation. And you would tend to tune that depending on looking at that SMF88 data I mentioned, seeing how much the z/OS logger is having to do offloads and then tuning it accordingly. Now, for DFHSHUNT I think we recommend similar values, but if you think about general logs, which I mentioned, we never read them. They’re only ever written to within a CICS run. So the only point of trying to optimize storage in the logger is if you want to keep data in primary storage so you can read it quickly. And that would be if you’re ABENDing a task. Want to back it out quickly. If it’s just audit log information that won’t get read back, or maybe never, or perhaps backed up in a few days or weeks time, then we don’t really care. So when they hit HIGHOFFLOAD, they could offload all the way down to zero. Because there’s no point keeping any data in the primary structure. It could all be moved out to disk and then the primary storage made available again. So it does depend very much on what the log streams are used for. But there is good guidance on all of that in the CICS documentation on logging and recovery. Right. Are there any more questions in the chat, Amanda?
[01:14:37] – Amanda Hendley
Sorry, I was muted. There are. So there is a follow-up to that. How do we handle DFHLG0777?
[01:14:51] – Andy Wright
Okay, right. That’s a message. I haven’t mentioned it in the presentation. It’s issued when we’re doing work in CICS and we have to call the Log Manager and the Log Manager calls the z/OS logger and the z/OS logger says “Ah, there’s a temporary problem, temporary situation that I need to sort out.” And so we report it with a message in CICS. And what we would normally do is automatically wait and retry the request. And normally it gets sorted out very quickly. So it could be, if you’re opening a file, for example, and the file has a forward recovery log stream or a journal associated with it, and that log has to be opened as part of the file open. And it could take a little while for the logger to open that log stream. Because if you think about the system logs, log and shunt, that’s all done for you during CICS startup. They’re always ready to go when applications run, but there can be events within CICS where there could be a temporary glitch where it can take time and we will automatically wait and retry. So you shouldn’t have to do anything with a triple seven. It’s something that we put out and will resolve automatically very soon afterwards. Okay, another question?
[01:16:14] – Andy Wright
So Ed says you touched upon an age old question to COLD-start or not. That is the question. So what is your personal opinion? He says that they warm-start our regions and only initially start them during upgrades.
[01:16:31] – Andy Wright
Okay, yeah, well again, that’s perfectly fine. And you could do that. And when you warm-start, your region will come up and look just like it did when you shut down. And that’s absolutely fine. So we will recover all the resources, the files and the terminals and the queues and everything that we’ve defined within CICS. We recover them from our catalog, which we write to and we read from. When you restart a region in a controlled way, we’re not getting them back off the CSD. The CSD is used on a COLD start when you install resources or an INITIAL start. So it’s absolutely fine to do what you’re doing, but some things will be tidied up if you do a COLD start or obviously an INITIAL start. You’ll reinstall things off the CSD. There’ll be a few bits and pieces inside that we will tidy up. Some use of storage could be optimized. So if you don’t need a COLD start, you don’t have to do a COLD start. What you’re doing is absolutely fine. And there’s no right and wrong answer to answer your question. But I know some customers who often COLD start or COLD start once a week, or some customers always COLD start just to have a clean or a re-installed environment from the CSD. What you’re doing is absolutely fine. I can see another question from Daniela. What about activity key points? What would be a recommended time interval between them? Well, if you remember, an activity keypoint is kicked off with that AKPFREQ SIP parameter, and that’s something that you define to CICS when you bring it up. And it’s saying every time that many log rights have happened, kick off one of these keypoint transactions to do some housekeeping. So if AKPFREQ is a big number, you have to have more log rights before you kick off a key point. If it’s a small number, you have less. I think the default I remember, I think it’s 1000, but you can change it up or down depending on what you want. Now, does it matter? It doesn’t matter because it doesn’t affect data integrity at all. What it does do is it means there’s a longer period of time between a keypoint. Now I said earlier, if you keypoint more often, if you have a low AKPFREQ, then you’ll run these CSKP transactions in CICS more often, you’ll trim the log more often than tidy up stuff in the log stream. So when you do have to do a controlled restart or an EMERGENCY restart, there’s less far to read back down the log before CICS can come up and say controls given to CICS again. So the more often your key point, all things being equal, the quicker your restart can be. Some customers obviously care about data integrity very much, but they care more about time to availability or they equally care, I should say, of time to availability. They might have service level agreements for outages not to last more than a certain length of time, for example. So you might want restarts to be very, very quick. So you might keypoint more often, or you might keypoint less often. And all it means is if you have to read back down the log on an EMERGENCY restart, it will take a little bit longer until you find the data that you need to know what to do to recover the system. But if you don’t EMERGENCY restart, it doesn’t matter. So the long answer to a short good question is there’s no hard and fast rule. And because there’s no hard and fast rule for a value of AKPFREQ, the time between those key points, between those DFHRM0205messages, it can vary enormously. So on a busy system, when you’ve got lots of logging taking place, you’re going to get to that AKP freak number relatively quickly. You might be keypointing every 30 seconds. I’ve seen every 15 seconds on a busy system – busy system and/or a low AKPFREQ number. On a quiet system, say a system which runs work during business hours and is left up overnight, you might get hardly any work coming in overnight, say from remote countries logging through to it. So you won’t log much, so you won’t reach the AKPFREQ number so often, so you won’t attach a keypoint task so often. So again, there’s no hard and fast rule, I would say if you’re seeing keypointing every minute or so, even on a fast system, that’s quite high. So you might then want to review your AKPFREQ setting because it might be a little bit too high for what you need. It might be that because you want to ensure a restart can happen very quickly if you have to restart, of course, like I say. Right, I hope I answered that. I can’t see any more questions on the chat. Did you have any more questions, Amanda, that you wanted to ask me?
[01:21:35] – Amanda Hendley
Could you touch on recovery with failures? If it’s unexpected, how long it can take you to recover and talk a little bit about EMERGENCY restarts?
[01:21:46] – Andy Wright
Yeah, sure. Well, I kind of mentioned a bit of that just now. When we have a crash, a system crash, a power outage or some sort of hardware check or machine failure of some kind, and you lose your CICS region,-when you restart it, there’ll be lots of bits of work that were half done, some files that were left updated but never committed, or some queues that were read or some databases that were half-changed but not actually committed. And we have to undo that work. So we have to read back down the log to find what work was in the system, read back to a keypoint, and then we know from that what data is in the system and we know what needs to be backed out. Now, that could take a while. If AKPFREQ is set high, it could take a little bit longer to read back down the log to find a key point. The work in the system, then that needs backing out, there could be a huge amount of records that that work was doing. Say it was a batch style transaction. And if that’s the case, that backout could take some time. So that’s kind of a weasily way of saying there is no hard and fast length of time for these things. Restarts can often take sub-minute, a few seconds. I’ve seen some on a very busy system with long running work that’s taken hours. But we do put messages out during the restart to alert you of what’s happening, if that’s the case.
[01:23:12] – Amanda Hendley
OK, well, it looks like all the questions we have. So, Andy, thank you so much for joining us.
[01:23:19] – Andy Wright
You’re very welcome. Thank you for having me. I enjoyed it. I hope everyone did as well.
[01:23:23] – Amanda Hendley
Yeah, and for everyone on the session today, we’ll have the recording up along with the transcript in just a couple of days, so you can check that out on the user group website. As for the rest of our agenda today, I wanted to share with you a couple of articles and news pieces. You can do any of these QR code scans with your mobile phone in the picture app. We’ve got a paper on batch optimization that I want to call out from one of our partners. And Planet Mainframe has a new job board. So I pulled out an opportunity in Germany actually, for anyone that wants to check that out. And as always, we’re looking for contributors over at Planet Mainframe. If you’ve got any content that you’d like to share, blog posts, articles, profiles and that kind of stuff, we have our social media that you can follow along on. Well, there is YouTube, but I meant to say LinkedIn or Twitter or X. And our videos go up on the YouTube channel as well. And on our website and YouTube you can find all past sessions. So check us out there last thank you to our partners in Telemagic Data Kinetic Planet Mainframe and Broadcom mainframe software for their support of the Virtual CICS User group.
[01:24:55] – Amanda Hendley
And our next meeting, January 9, we’ve got Todd Havocost. He is with Intellimagic, and he’s going to be presenting same time, same place. So again, thank you all for being here. Andy, thank you so much for presenting, and I hope everyone has a wonderful holiday season, and we’ll see you in the new year. Thank you so much.
Upcoming Virtual CICS Meetings
March 11, 2025
Virtual CICS User Group Meeting
May 13, 2025
Virtual CICS User Group Meeting