Virtual IMS User Group Sponsors

Virtual IMS User Group | August 2023

What IBM Z Cyber Vault means for an IMS environment

Tracy Dean
Product Manager
IMS Tools, and z/VM Tools IBM

Read the Transcription

[00:00:00] – Amanda Hendley – Co-Host (Planet mainframe)

Welcome to today’s Virtual User Group IMS session. My name is Amanda Henley. I am co host with Trevor Eddols. Trevor couldn’t be here today, unfortunately, but he has left it to me to host a really great session for you featuring Tracy Dean. So I’m excited for our program today, and I hope all of you have come with questions to ask of this session. You’ve probably found us by going to the iTech-Ed website in the IMS page, so I don’t need to bookmark that for you. But if you’re not familiar, if you’re interested in Db2 or CICS, there are user groups that meet actually in September for those groups as well. So you can find those on iTech-Ed.com as well. So thank you so much for joining us today. Our agenda is pretty simple as always.

[00:00:57] – Amanda Hendley

We have our introductions. We’re going to move into our presentation. We’ll have Q&A, we’ll talk about some articles and news, and we will end by announcing our next session. Within about a week, we’ll have the edited video up. Within about a day, we’ll have the presentation, and a couple of days, we’ll have a transcript for you as well. For our Q&A today, we want you to drop any of your questions that you have in the chat as they come up. Tracy will break periodically in order to address those questions because we absolutely don’t want you to get lost along the way or get stuck on one of those questions that noodles around in your head and then you can’t focus on anything else. So drop your questions in the chat as they come up. We will take periodic breaks for us to address any of those questions.

[00:02:04] – Amanda Hendley

I want to thank our partners for sponsoring the Virtual IMS User group. BMC is an annual partner of the group and also Planet Mainframe is also a partner. So excited and thankful to have these awesome companies on board.

[00:02:22] – Amanda Hendley

And now we’re ready to get started. So I’m going to introduce Tracy as she takes over screen sharing. And we’re here for “What does IBM z Cyber Vault mean for an IMS environment?” And Tracy is joining us. She’s a product manager for IMS Tools and Z/VM Tools at IBM. She’s got 38 years of experience with Z, and she works with customers worldwide to understand their pain points and drive development teams to address their needs. So, Tracy, thank you so much for joining us today. I’m going to turn it over to you.

[00:03:03] – Tracy Dean – Product Manager, IMS Tools and z/VM Tools, IBM

(Slide 1 – IBM Z Cyber Vault and IMS) All right, very good. Thank you, Amanda. So today we’re going to dive into a topic called IBM Z Cyber Vault. Going to focus specifically on what it means for IMS. I will do a little bit of an introduction of what Cyber Vault is. This is not a deep dive on Z Cyber Vault. If you need more information on that, let me know and I’ll put you in touch with the right people. But I did want to give you some context. If your company is looking at Cyber Vault, what does that mean for you as an IMS person? And how will you or how should you be participating in those conversations and in the implementation and deployment? So I’m going to give you enough background. Z Cyber Vault help you understand the context. I’m also going to tell you what it’s not, because it’s most important to understand what Cyber Vault is not versus what it is as well.

[00:03:53] – Tracy Dean

(Slide 2 – Logical data Corruption) So let’s start with what we’re focused on. We’re focused on logical data corruption. And so what does that really mean? What that means is the hardware components are working as expected, right? Everything is up and running. This is not flipping over to a Disaster Recovery (DR) site because you had a hardware failure or a site failure.

[00:04:14] – Tracy Dean

This is about logical data corruption, where the data becomes destroyed or corrupted at the content level, right? And so somebody’s deleting things, somebody’s encrypting things that aren’t supposed to be encrypted, somebody’s selectively manipulating things. And this is something that you cannot prevent with your traditional high availability disaster recovery solutions, right? Because those solutions are not content aware. If someone comes in and deletes data or changes data, your applications will continue to run and those solutions will continue to replicate those errors over to your DR or HA (High Availability) site, right? So everything is always in sync for HA and DR. That’s what we want. We want all that data copied over continuously, but when the data goes bad, copying it over is not helpful. We cannot recover at the DR site because the DR site is also corrupted. So this is more about undetected or detection that takes time, and silent data corruption. Those are really the most dangerous types of errors because your applications might continue to run, you might get some strange errors, but you’re not going to get the whole system failing because you’re having a hardware failure or something, right? That’s what we’re focused on here. We’re focused on this logical data corruption, not on your high availability and disaster recovery solution.

[00:05:46] – Tracy Dean

(Slide 3 – Why traditional resiliency solutions will not protect you from logical data corruption) So let’s talk about kind of what you might be doing today and what Cyber Vault IMS really trying to address. So what you might have today in terms of replication. As I just said, you’ve got data replicating continuously, but that means any errors or any corruption is also being replicated continuously. And so what you want and what Cyber Vault is trying to provide is a scheduled point in time copy stored in an isolated, secure environment. So, of course, I hope you already have image copies, you have backups, those kinds of things. But what we’re trying to do is address the issue of keeping multiple versions of those of your entire system at different points in time and being able to go back to those at any point in time. So we’ll talk about maybe how this is different than your normal image copy and how your normal image copy still needs to continue, because you might also be using that in your Cyber Vault environment. So we’ll get into that.

[00:06:46] – Tracy Dean

The second thing is error detection, right? I’m sure you already have great monitoring in place. You’re already detecting system and application outages. But this is “What if the data gets corrupted and the application continues to run – the system continues to run.” What Cyber Vault is looking to provide is regular data validation on these point in time copies to make sure that the data is still good, that you’re still in a good place, that people are not corrupting the data unbeknownst to you. And then recovery points. Typically, of course, you’ll have single you might have multiple recovery points, but if someone is compromising your system, they might be compromising your backups as well. So they might be erasing those image copies, destroying those image copies, corrupting those image copies, those kinds of things. And so what we want in the Cyber Vault environment again is this isolated, secure, location that cannot be accessed from the production system with multiple recovery points. So if it takes you a while to go back, you have multiple copies along the way to know which one is, the good one. Isolation. Right now, of course, in your production environment, you often have all of your system, storage and tape are in the same logical, really, system structure. And what we’re looking at with Cyber Vault is an air gap system and storage so that these logical errors and these malicious intruders cannot get to that environment and cannot propagate the production errors over to the Cyber Vault. Cyber Vault will be completely air gapped. It’s not accessible from the production system. And then there’s recovery scope. Of course, what you’re doing today is you’re focused on high availability or continuous availability. You’re focused on DR. What we’re focusing on in Cyber Vault is when there is a corruption, being able to do forensic analysis to know what changed, and being able to do surgical or catastrophic recovery as well. So we’ll talk about each of these areas as we go through the presentation.

[00:08:52] – Amanda Hendley

(Slide 4 – IBM Z Cyber Vault) So let’s look at Cyber Vault at a pretty high level. So we have our production system on the left, we have our production volumes. We’re using a storage technology called SafeGuarded copies. So this is part of your storage technology, nothing to do on z/OS. This is a storage technology to do SafeGuarded copies, and it will create those SafeGuarded copies in an air-gapped environment in your Cyber Vault. Those copies are immutable. They can only be read, they can never be modified. That’s the nature of the SafeGuarded copy. And so the idea is you take multiple SafeGuarded copies, you take them periodically, so you have multiple per day, perhaps, certainly multiple per week. And you have this period of time where you have a complete copy of your production system. This is not just an image copy of your IMS databases. Cyber Vault is looking at the whole z/OS environment and it’s doing a full copy of your production environment into these SafeGuarded copies. And then what happens is in the Cyber Vault, you have another LPAR and you can recover any of those SafeGuarded copies to the Cyber Vault volumes. So, again, I can never change the SafeGuarded copies, but I can copy them and use the recovery process of Cyber Vault to put those onto the Cyber Vault volumes and IPL the LPAR in the Cyber Vault and do testing to make sure the data is good. I can do validation, I can do forensic analysis if there’s a problem. I have a full access to my production environment in this LPAR, but I’m not touching my actual production system. But again, the SafeGuarded copies are never changed, right? They’re just copied to the Cyber Vault volumes, where you can then IPL a system and do some testing.

[00:10:50] – Tracy Dean

(Slide 5 – IBM Z Cyber Vault – Focus areas for IMS) So what I’m going to focus on in this presentation is the IMS areas of Z Z Cyber Vault. Going to focus on, from an IMS perspective, what would we do for data and data structure validation, how would we do forensic analysis, and how would we do surgical recovery? I am not going to focus on other areas of Cyber Vault, like catastrophic recovery and offline backup, because those are really at a system level, not at an IMS level.

[00:11:19] – Tracy Dean

(Slide 6 – Validation – Forensic Analysis – Surgical Recovery in the Z Cyber Vault) So the Cyber Vault cycle sort of looks like this, right? When I re-IPL, when I copy those SafeGuarded copies, z Cyber Vault volumes and IPL the system, I can use that system now to do data validation. And that’s what I should be doing. That’s part of the point of Cyber Vault is to do when you create these SafeGuarded copies, you validate them. You make sure that the system is good at that point in time. And so you want to create a repeatable process, right? You want to automate that process so that you can do it fairly quickly. And it really is to validate that the system is good. If in that validation phase in the Z Cyber Vault, discover that there’s some kind of corruption, then we move into the forensic analysis phase. And this is where you’re using that Cyber Vault system. You’re still not on production. You’re using that Cyber Vault system to do forensic analysis of what happened, when did it happen, how did it happen, and how can I recover from it? So I can not only plan my recovery, I can actually practice my recovery. Once I know who, what, where, why, when and how I’m going to recover, then I move into the recovery phase where again, I might do this in the Cyber Vault environment, or I might do this in the production environment. And we’ll walk through each of those types of scenarios in the Cyber Vault where I can validate my recovery process, I can then actually perform recovery again either in production or in the Cyber Vault and copy it over. And now I have my production system back up and running with valid data and the corrupted data is no longer there. So these are kind of the phases that we’re going through in a Cyber Vault environment. And this is the big picture of why we are looking at Cyber Vault and why customers are looking at Cyber Vault.

[00:13:27] – Tracy Dean

(Slide 7 – IBM Z Cyber Vault – data validation) So let’s take each of those phases and talk about them. So in the Cyber Vault environment, let’s talk about data validation. I have my production volumes here on the left. I’ve done SafeGuarded copies. In my particular case, I’m looking at every hour. I’ve created a SafeGuarded copy every hour. And when I create the SafeGuarded copy, any of those that I’m going to do validation on, I can also flash copy to my Cyber Vault environment, to my Cyber Vault recovery volumes. And this allows me to start my data validation sooner. Rather than waiting for this SafeGuarded copy to occur, then the copy over to the recovery volumes to occur, I can actually just do the flash initially right away and I can get my data validation going sooner. Now, I want to say a couple of things about data validation. One is while in a perfect world, we would love to validate every copy, right, every SafeGuarded copy, we would love to validate that and know that it’s a good copy. In reality, if you’re taking copies every hour, that might not be possible, might not be practical.

[00:14:42] – Tracy Dean

And that’s okay. It’s okay to take more SafeGuarded copies than you validate. Maybe you only validate every other one or every third one. That’s okay. It’s still worthwhile to take the SafeGuarded copies because, in data validation, when you do discover a problem, you can now go back to the previous one that you might have never validated and see if it was good. And it might be good. You don’t have to go back 3 hours to the last one that you validated. You might only have to go back an hour. So it’s okay to take more SafeGuarded copies than you have time to validate. The validation just gives you a warm fuzzy that things are good and when they’re not, then you start going back in time to see which previous copy is good. So in the Cyber Vault environment, you IPL the flash copy if you’re doing it that way, or you IPL the copy that you copied from the Cyber Vault to the recovery volume. And you can use System Recovery Boost to help you with this. And you’re going to do some basic checking of the Sysplex infrastructure, for example.

[00:15:49] – Tracy Dean

And then you’re going to move on to data structure validation. This is what we call type two. What this means from an IMS perspective is we restart IMS and then we run things like a pointer checker. Right? Is my basic structure of my IMS database still good?

[00:16:07] – Amanda Hendley

Now If I have applications that require consistency between IMS and Db2, then I need to make sure Db2 is up as well. I can also do validation of my resources to make sure I have what I need for recovery if needed. So to make sure that I still have my image copies, they haven’t been destroyed, so I can do that kind of validation in the Cyber Vault environment as well. So I’m not just checking the status of my actual databases and my IMS system. I’m checking to make sure I actually have image copies that are available to me so that I can recover if I need to from this copy, from this SafeGuarded copy. Those are the easy things, relatively easy right. Data structure validation, Sysplex validation, bringing the system up, those are pretty easy. The really interesting piece and the really difficult piece is Type 3 – is the Data Content. Has somebody been changing the content of the data? The structure of the database still looks good, but somebody’s been messing with the actual data contents. That’s not something that we have anything out of the box that we’re going to be able to give you to do that. Because it’s your data, you know the structure of it. You know what it’s supposed to look like. So this is something your organization is really going to have to look at and think about and either work with IBM or work within your organization to find out the best way for you to do that. And start small. Start small and continue to build. It’s okay. You don’t have to do everything day one. It’ll only give you more confidence. The more sophisticated you get in Type 3 validation, it’ll just give you more confidence that that copy is good. Now, of course, if no issue is found, we might create a tape copy, for example, to spin that off, whatever it is that you want to do. But the point is, we’re going through this validation process. Now, in addition to this flash copy that we’re taking to the recovery volumes, I can have another set of recovery volumes so that if my data validation fails, I can also still be continuing my SafeGuarded copies while I’m doing other recovery activity, or forensic analysis. So I can have two sets of recovery volumes, one for my initial data validation and one for my actual forensic analysis and recovery. So that’s up to you. Not required, but it’s certainly something you can consider.

[00:18:38] – Amanda Hendley

(Slide 8 – IBM Z Cyber Vault – data validation) So when I do this data validation, I also have what we call this permanent volume over here. So these are my recovery volumes, where I’m doing my actual testing. But I also have a permanent volume, and this volume does not change anytime I do a new recovery. So anytime I do a new SafeGuarded copy, this one gets replaced, for example, if I’m doing data validation on it. Anytime I’m doing a recovery from the Cyber Vault, this one gets completely replaced – no data from the previous one is kept. So I have this permanent volume out here so that I can keep data between my IPLs of different SafeGuarded copies. So I can keep documented results over here, I can keep information about a previous IPL from a previous SafeGuarded copy, etcetera. So I always have this permanent volume also where I can keep data that does not get destroyed every time I do a new recovery. So again, I use this permanent volume maybe to keep the data. I might also have automation in place that lets people know that we have a good validation, for example.

[00:19:47] – Amanda Hendley

(Slide 9 – IMS data structure validation) So when we talk about data structure validation, I mentioned pointer checking to validate the database structure. And it can also detect changes in your database characteristics, such as the size and number of segments. So if it detects there’s a significant change in your database characteristics, that might also be a flag in your data validation to say, hey, let’s go make sure this is correct, or this is an indicator that somebody’s been making changes, unauthorized changes to the database. So I work for IBM, so I’m going to talk about the IBM solutions here. Certainly other vendors have other solutions that can do these things for you, but the IBM solutions is High Performance Pointer Checker for your full function databases, which we have available as a standalone product or in some of the packs, and then we have the Fast Path Solution Pack for your Fast Path databases. In terms of recovery readiness, I talked about verifying in the Cyber Vault that you have assets needed for recovery, right? Making sure that those things are available, that you could do a recovery if you needed to with that copy of the environment, with that SafeGuarded copy of your production environment. So IBM solution in this space is Recovery Solution Pack. And then we also have our new web UI in IMS Admin Foundation as part of IMS Tools Base. And that’s going to show you your recovery readiness exceptions and reporting.

[00:21:26] – Tracy Dean

(Slide 10 – Forensic analysis in general) So that’s data validation. We’ve made our SafeGuarded copies, we’ve IPL’d the SafeGuarded copy, or IPL’d the flash copy. We’ve done the data validation, we found an error. What if, in that data validation process, we’ve now found an issue? So now we’re moving on to forensic analysis. So I’m working in the Cyber Vault environment. This is maybe the copy, this lower copy here. The third one is the one that had the error. It generated the issue and said, oh, there’s an error in data validation. So now I go back to the previous one, which I may or may not have ever validated. I go back to the previous one and IPL it in the Cyber Vault. Aain, I’m making a copy to the recovery volumes – I never change these SafeGuarded copies. And I IPL it, and I run my validation, and I say “No, that one’s bad as well.” So clearly I didn’t run my validation on this one. And then I go to the next one in line, and I IPL that one, and now I find a good copy. So forensic analysis is finding the good copy, but also collecting data maybe from the bad copies to understand what happened. And that’s what this permanent volume can be used for, saving data. When I IPL this one and the database is corrupted, I might want to save some log data. Save that off to the permanent volume before I wipe it out with the next SafeGuarded copy. So I IPL the Safeguarded copies. I can save logs or anything else that I think would be helpful to me that I want to have access to. I can then use some tools to understand the problem – look at the analysis, find a clean copy, and understand what changed between the good copy and the bad copy. Understand where things went wrong, what happened, and then start thinking about “How am I going to recover from this situation?” And I have this whole environment where I can practice that recovery and plan my recovery without impacting my production system.

[00:23:31] – Tracy Dean

Because maybe it’s just one IMS database that’s bad. I don’t need to take down my whole production system. I can keep it running, take that database offline while I figure out what happened. But in the meantime, I have a whole safe environment to work in that’s not impacting my production environment.

[00:23:53] – Tracy Dean

(Slide 11 – IMS Forensic analysis) So when we talk about forensic analysis, the most important thing is we have to collect the data in the production environment. If we’re not collecting the data in the production environment that we need for forensic analysis, then it’s not being copied to the SafeGuarded copies, and it’s not being copied to the recovery volumes, and I don’t have access to the data. So most importantly, we need to make sure we’re collecting the data to do the forensic analysis while we’re running in production at all times. So with IMS, the simple thing, of course, is the log data, and that’s done for you automatically So you don’t need to worry about that.

[00:24:28] – Tracy Dean

If you’re using IMS Connect or z/OS Connect or anything that’s coming in through IMS Connect to get to your IMS system, then IMS itself does not collect that data. So you’ll need tools to actually collect the data for any of your applications that are connecting to IMS via TCP/IP. In IBM’s case, we need IMS connect extensions. That’s what’s going to collect that instrumentation data similar to the IMS log, but it will collect the data about what’s going on in IMS Connect. So that needs to be running in production so that we’re collecting the data and we have it available in the Cyber Vault. And then I want to create reports, so I want to list maybe the transactions that have been processed during the time period between the good copy and the bad copy – let me see a report of all of the transactions that updated the database, for example. So I can put in time periods, all kinds of filters to get exactly the list of transactions that I’m looking for. And so IBM solution in this space is Performance Analyzer for creating those reports. Once I have the report of the list of transactions, maybe I start looking at those, look at the details of those, and try and find the ones that are starting to be suspect and dive into those more to understand exactly what happened and what those transactions did. And that’s where we bring in IMS Problem Investigator to look at the individual transactions and the details behind them of what happened during that time period. So Performance Analyzer gives you a start of what happened during this time period. Problem Investigator is “Okay, I want to look at these specific transactions and deep dive into them and determine how I might want to recover.” If there’s just one corruption, I might be able to just fix it manually. If there’s full scale corruption, then I need to do a full recovery.

[00:26:48] – Tracy Dean

(Slide 12 – Forensic Analysis for the example scenario) So let’s talk about forensic analysis. I’m going to go through a scenario here where we have a corruption, and we determine that the 11:00AM copy – when we’ve run data validation on it – the 11:00AM copy of my SafeGuarded copies is corrupt. There’s something wrong. So I save the logs to the permanent volume. I’ve determined that the 11:00AM copy is corrupted, and I’m going to repeat this process until a clean copy is found. I find that the 10:00AM copy is good. Okay, so I know the 10:00AM copy is a good copy. The 11:00AM copy is bad. When I IPL the 10:00AM copy, I’m going to get transactional consistent data, because when I IPL that, all the in-flight transactions will be backed out and I can actually get transactionally consistent data. So I can actually use my 10:00AM good copy as my baseline for my surgical recovery. Does not mean I have to go back to 10:00AM, right? We still haven’t determined yet where the issue occurred. It’s occurred sometime between 10AM and 11AM and eleven. So I know I have a good baseline at 10AM and I know I have a bad copy at 11AM. So now I’m going to use my tools to do performance, to do reporting and deep dive analysis. And I’m going to determine in my scenario that the malicious activity occurred at 10:50AM. So this is the key here is we’ve got an activity at 10:50AM that corrupted the data. So how can I use this information to help me plan my recovery?

[00:28:33] – Tracy Dean

(Slide 13 – Surgical Recovery – scenarios) So I’m going to talk about different recovery scenarios in this same context of an 11:00AM bad copy, a 10:00AM good copy, and a 10:50AM corruption. So I’m going to use these three scenarios. There are obviously many more permutations of these, but I think these address kind of the three big areas. And the first one is that the backups are available in production. I have good image copies. I’ve determined from my Cyber Vault that I have good image copies of production at 11:00AM, for example. So even though my database was bad at 11:00AM, my image copies are still intact. So that’s your best case scenario, if you will.

[00:29:14] – Tracy Dean

Scenario two that I’m going to talk about is the backups are available only in the Cyber Vault. That’s because the valid image copies do not exist in the production environment, because they do not exist because they were either corrupted or they’ve timed out or they’re only on tape, et cetera. But we are going to assume in this scenario that valid image copies exist on DASD in the Cyber Vault environment because they are in the 10:00AM copy. So they’re no longer in production, sometime between 10AM and 11AM the corruption occurred and they got rid of the valid image copies. But we still have those valid image copies in the 10:00AM copy. Remember, this is the 10:00AM copy. The database is good, the image copies are good and it’s immutable. So I always have the ability to go back to this even though my current production environment no longer has valid image copies. And then the third scenario I’m going to go through is I don’t have any, I don’t have image copies in the production environment. I don’t have image copies in my Cyber Vault on DASD. All my image copies are lost. How am I going to recover? So these are the three big scenarios I’m going to talk about.

[00:30:37] – Tracy Dean

(Slide 14 – Detailed scenario description – (example)) So in this scenario, kind of the general architecture we have is we have an active Sysplex in two sites in City A, global mirror to City B, and then City B serves as the DR site. And then the SafeGuarded copies are implemented in City B and they’re taken every hour. And as I said, the 10:00AM copy is good and the 11:00AM copy is bad. I’m doing validation consistently, and I’ve got some applications running. So remember, I take the 11:00AM SafeGuarded copy and I run data validation at 11:25. The timeline I’m using in my scenario is at 11:25 is when I detect that the corruption has occurred, that the 11:00AM Safeguarded copy is not good. I do my forensic analysis, and it shows that the corruption began at 10:50 and only two applications are impacted. So these two applications are stopped in production, but everything else gets to keep running. So my production system is still running, even though I know I have corruption on these particular applications.

[00:31:47] – Tracy Dean

(Slide 15 – IBM Z Cyber Vault (3-Site solution, virtual isolation)) So here’s kind of a picture of the environment. We have our production environment over here. I’m taking regular I/O consistent SafeGuarded copies at the top of each hour. I’m doing IPLs in the Cyber Vault environment, and I’ve discovered that the error occurred. I discovered at 11:25 that 11:00AM is bad and 10:50AM is good. So this is just sort of a picture of how this is fitting together. This is Copy Services Manager doing the SafeGuarded copies over to your Cyber Vault environment or your DR site. And then within the Cyber Vault environment, we’ll do the SafeGuarded copies.

[00:32:31] – Tracy Dean

(Slide 16 – Choosing a recovery solution) Okay, so I’m going to talk about scenario one. Just a reminder that’s where the image copies and the logs are available at the production site. In this particular case, in my scenario, I’m going to say that the image copies were taken after batch end at 05:10AM. So obviously they were taken before the 10:50AM corruption. And I still have these 05:10AM copies available in my production environment.

[00:32:56] – Tracy Dean

(Slide 17 – Surgical recovery – scenario 1) So in this case, what I’m doing is I’ve discovered this issue. I’ve done the forensic analysis. I’ve discovered when the issue occurs, and I’ve used the Cyber Vault to identify the time of the issue. And I actually can practice my recovery in the Cyber Vault, but I will more likely perform the recovery actually in production because I have the image copies over here. So this would be your normal recovery that you might do for any other reason. And I want to recover to point in time of 10:49AM. So once you’ve done the work in Cyber Vault, to do the data validation, find the error, do the forensic analysis, determine the time, determine you’d have good image copies. I’ve done all of that in the Cyber Vault environment. Now I go back to production and just do my normal recovery process to 10:49AM. I don’t need the Cyber Vault to actually do the recovery. The advantage of Cyber Vault this scenario is I had a place to play, I had a place to practice, I had a place to analyze without impacting my production system. So this one is the very simplest scenario because the end result is I’m just doing my normal recovery – normal Point In Time recovery.

[00:34:12] – Tracy Dean

(Slide 18 – Surgical recovery – scenario 1) So this is just a bit of a detailed description. Again, I discover and identify the malicious transaction, I do a Point In Time recovery, and I basically use the 05:10AM image copies and a log forward apply Business as Usual of how I would do a Point In Time recovery from my image copies to 10:49AM.

[00:34:32] – Tracy Dean

So now at a minimum, I have my database recovered to this consistent point in time before the malicious activity. It’s recovered to 10:49AM. Optionally, I can use something called Queue Control Facility to replay any “good transactions” that occurred after the 10:50AM malicious transaction. Right. Because remember, I didn’t discover this till 11:25AM, so I didn’t stop things in production until 11:25AM. So there might have also been some good transactions that I would like to preserve. Now, I’m going to caution you about using this option. This is an option you have to be very careful with. You have to know exactly what you’re doing, what your applications are doing, to make sure that you can replay those good transactions without replaying the malicious ones and still end up with a consistent database. And make sure that it’s going to be consistent with Db2, if that’s important to you. So this is an option that you have, but it’s one to be used with caution and with lots of knowledge about your environment. But it is an option to be able to replay those good transactions if you’re able to do that in your environment.

[00:35:53] – Tracy Dean

(Slide 19 – Surgical recovery – scenario 2) Okay, so I’m going to move on to scenario two. This is where the image copies are only available in the Cyber Vault. So the image copies in production are no longer available. Maybe as part of the corruption, the cyber activity destroyed my image copies. But I do have good image copies in the Cyber Vault because I have these immutable copies taken over time that can never be changed. So in this case, what I’m going to do is I’m going to take the 10:00AM Image copies. I’m going to take the 11:00AM logs. Remember, I’m able to save those things off to the permanent volume so I can save the 11:00AM logs before IPL with the 10:00AM image copies, and I’m going to copy that over to Cyber Vault disks that have access to copy back to production. Okay, so this is another storage technology where I’m going to put them on the staging volumes. The staging volumes can then be copied to staging volumes in production using storage technology. And now I have good image copies sitting on a staging volume in production. So now, again, I can do a point in time recovery with those 10:00AM image copies and the 11:00AM logs to get me back to 10:49AM. But I had to copy some data from the Cyber Vault over to my production environment first because the data did not exist in my production environment.

[00:37:32] – Tracy Dean

(Slide 20 – Surgical recovery – scenario 2) So again, the assumptions in this particular scenario, the IMS is available and continues to run in production for most of your applications. The log files are corrupted in production. The image copies are not accessible in production – either because they did not exist or they were corrupted by malicious activity, or maybe the tape catalog was corrupted. So my recovery approach again, I’m going to identify the malicious transaction that occurred at 10:50AM. I’m going to capture the image copy that was taken before in the previous SafeGuarded copy at 10:00AM. So I’m going to take the 10:00AM image copy, I’m going to put that on the staging volume, and then I’m going to copy the logs from my 11:00AM Safeguarded copy to production using the staging volumes. And then I can recover the database from image copies and replay the good transaction, same as scenario one. Now, if the image copy is not available, I can actually create a clean image copy from a clean copy of the database at 10:00AM.

[00:38:39] – Tracy Dean

So remember, at 10:00AM I have a good system, I have a good database. I can IPL that and create a good image copy, put that on the permanent volume, and now I’m able to move that over to production and create a 10:00AM image copy, if you will. Now, alternatively, of course, I could just copy the data sets from the SafeGuarded copy to the staging volumes and replace the production data sets. But typically what we would want to do is go through a formal recovery process so that everything in IMS knows what’s happened. Again, the result is the same as scenario one. I have a database recovered to a consistent time before the malicious activity at 10:50AM. I can replay transactions after 10:50AM if I want to, if I’m able to do that without jeopardizing consistency.

[00:39:34] – Tracy Dean

(Slide 21 – Surgical recovery – scenario 3, phase 1) Okay, so let’s move on to scenario three. This is the worst scenario. There’s no image copies in production. There’s no image copies in the Cyber Vault. And maybe that’s because all of my image copies go to tape. I don’t have any on disk, so they’re not copied to Z Cyber Vault and image copies in production have been destroyed, either because the image copies themselves were destroyed or the tape catalog was destroyed, all as part of this malicious activity. So, again, I identify the base for recovery. I perform my forensic analysis. I determine that the 10:00AM copy is good, the 11:00AM is bad, and I’m able to copy the logs and the recon from the 11:00AM copy over to the base of recovery – over to my 10:00AM system, I’m going to notify IMS that the 10:00AM copy is going to be used for recovery. I’m going to execute a Point In Time recovery. And in this case, in my case, I’m using IMS Database Recovery Facility and I’m replaying the good transactions. So again, I’m able to create an image copy from the previous good version of the database, and copy that over. And then I just start the applications in the Cyber Vault and check the status. So again, I have the Cyber Vault environment to verify everything before I actually return everything to production. So I can test in Cyber Vault, I can practice in Cyber Vault, and I can make sure that my recovery plan will actually work.

[00:41:14] – Tracy Dean

(Slide 22 – Surgical recovery – scenario 3, phase 2) So this is where we are showing a picture of that approach. We’re going to have our out of region DR site. We have our Cyber Vault over here. The 11:00AMis bad. I have the 10:00AM that’s good. I create a good image copy with the 10:00AM, and I copy the recovered database from the recovery volume to the staging volume in the Cyber Vault environment and then copy that using the copy technology into my production environment. And then I’m just doing normal recovery over in my production environment.

[00:41:54] – Tracy Dean

(Slide 23 – Surgical recovery – scenario 3) So again, this case, the logs are not usable. The image copies are not available because they were maybe created only on tape in production. So I’m going to identify the malicious transaction, obtain the last clean copy of the 10:00AM database, and I’m going to obtain the logs from the 11:00AM. I’m going to execute a Point In Time recovery in the Cyber Vault environment to 10:49AM using the 10:00AM Database and the 11:00AM logs. I’m going to do a Point In Time recovery within the Cyber Vault environment to create a good copy of the database. And then I’m going to copy that over to my production environment. And again, I could use Queue Control Facility if I wanted to replay the “good” transactions after that, if I am able to do that again without jeopardizing consistency.

[00:42:45] – Tracy Dean

So in the end, all three scenarios have the same result. I have the database recovered to a consistent point in time before the malicious activity. And if I can and I have the knowledge and the ability to replay “good” transactions without jeopardizing consistency, I might even be able to recover to after 10:50AM, but eliminating the malicious activity.

[00:43:13] – Tracy Dean

(Slide 24 – For all recovery scenarios) Okay, so those are the three scenarios. Any questions, comments?

[00:43:35] – Tracy Dean

The question is about IBM IMS. “It seems like you have to have IBM IMS tools in your environment already to use Cyber Vault effectively. What are the tools, and can we use third party tools?”

[00:43:52] – Tracy Dean

So I am going to cover a chart at the end about which IBM IMS tools you might find useful in a Cyber Vault environment. Many of those have an equivalent 3rd-party other vendor tool. So you’re certainly welcome to look at that list and say “Oh, yeah, we have that function with this other tool.” Cyber Vault itself does not require IBM IMS tools. This is all ways to make it better. And IBM IMS tools have functions here. You can look at your vendor tools and see if they meet those same needs. But I will have a chart at the end that I’ll go through about which of the tools that you might find useful.

[00:44:27] – Tracy Dean

I will say that obviously, remember, Cyber Vault is just a copy of production. So whatever you want to use in the Cyber Vault need to have installed in your production environment. If you want an analysis tool, you need it installed in production because the only way it’s going to get to Cyber Vault is through these SafeGuarded copies. So everything that you want in the Cyber Vault for tooling needs to be installed in your production environment as well. But I will talk about and hopefully answer your question by the end here of what’s kind of the full list of tools to consider and what’s their role.

[00:45:11] – Tracy Dean

So something to keep in mind. This does not replace you creating image copies. Yes, I did show a scenario where no image copies were available, but nobody wants that scenario, right. The first scenario where you had good image copies is much easier and much faster to recover from than the scenario where you have no image copies. So regular image copies is still very important for your day-to-day recovery needs. And even when you’re in a Cyber Vault environment, you should not consider Cyber Vault your new backup policy. Recovering from the Cyber Vault will always take more time than recovering from your normal procedures in your production environment. The purpose of the Cyber Vault when the data is not available in production and also to allow you to do that data validation and forensic analysis in a protected environment without impacting your production system. The other thing you need to think about is the frequency and retention of your SafeGuarded copies and the frequency and retention of your image copies. So, for example, if your image copies are taken once a week, but the SafeGuarded copy retention is only two days, then you have five days where you have no image copies in the Cyber Vault. And so you really don’t have an easy way to recover. Again, we can always go to scenario three, but we should not be counting on scenario three. We should not be planning for scenario three. We should be planning to do the easier ones in scenarios one and two. So make sure your SafeGuarded copy retention policy and your image copy retention policy are in sync. And again, this is where somebody else in your organization, probably your storage team, is the one that’s actually looking at Cyber Vault. But you as an IMS person, need to insert yourself and say, we need to be in sync and understand what you’re doing so that we can make sure our image copies will meet your needs and also make sure we have the tools we want to be able to recover in a Cyber Vault environment.

[00:47:19] – Tracy Dean

(Slide 25 – IMS recovery) So in terms of recovery, there’s kind of three different ways to recover. And I’ve really talked about two of them, but I really haven’t talked about the first one. So if in your data validation or even in your production environment, you discover that someone has corrupted some data, and it’s a very small amount of data, right, it’s not a whole full scale corruption – it’s a small amount of data. You really do not want to do a full recovery and lose transactions perhaps, to fix this problem. So we do have something called IMS Database Repair Facility that in conjunction with IBM (we always recommend you involve IBM level two support when you’re using this facility), but you want to have it installed and running and know how to use it, it’ll allow you to, and I use the word loosely, it’ll allow you to “zap” segments. “Zap” pointers. “Zap” segments. Do kind of a repair of a database – a very small change that needs to be made. Not a full scale recovery, but I just need to repair a couple of segments or a couple of pointers. So IMS Database Repair Facility is included in high-performance pointer checker for your full function databases, and it’s included in Fast Path solution pack for your Fast Path databases. So this is something you should know how to use. You should have it up and running. You should have it installed and configured. But we recommend you typically not use it on your own. You use it in conjunction with IBM through a case to make sure that you’re not doing further corruption.

[00:48:50] – Tracy Dean

So that’s the really simple case. The next case is where you actually have to recover the databases. We went through these scenarios in this presentation, and of course, recovering to a consistent point in time prior to corruption, perhaps synchronizing your recovery with IMS and Db2, and we have IMS Recovery Solution Pack from IBM to help you with that. And then the third area, which I’ve been mentioning is this replaying of valid transactions. And this could be useful to you whether you’re recovering or not, but you might want to replay transactions either in a test environment or in your production environment if you’ve had to roll back and you want to replay some of your transactions. You can use Queue Control Facility to do that and you can tell it which transactions – it’ll load, all the transactions before it plays them. You can unload the ones you don’t want to replay, those kinds of things. But again, you need to know a lot about your environment and those transactions to make sure you’re not going to end up with a database that’s not consistent – not transactionally consistent.

[00:49:57] – Tracy Dean

(Slide 26 – IBM Z Cyber Vault software selection – summary for IMS) So this is the list I promised. Don’t get overwhelmed because there are some repeats on the list, but I’ve broken it down into data validation, forensic analysis, and surgical recovery. And some tools play in all three areas so they’re repeated. And some of them are only useful if you have DEDBs. So if you’re not a Fast Path customer, you don’t need that. So when we go through this list we’re talking about for data validation, we talked about Pointer Checker, we talked about Fast Path Solution Pack if you have DEDBs. And even in data validation, Recovery Solution Pack gives you that ability to confirm that you have the assets needed for recovery. So there’s sensor collection going on, there’s policy comparisons, there’s exceptions generated if you suddenly find or if we suddenly find that you don’t have all the assets you would need to do a recovery.

[00:50:52] – Tracy Dean

When we move into the forensic analysis phase, IMS Connect Extensions is needed. We don’t really need that in the Cyber Vault, although it will be there because we’re not running workload and creating transactions. But we do need it running in production to collect the data about your IMS Connect transactions so that we can look at that data in the Cyber Vault. This is one of the interesting ones where we don’t even use it in the Cyber Vault, but we do need to install it and use it in production to collect data that we can use in the Cyber Vault with Problem Investigator and Performance Analyzer. I talked about these two tools in terms of forensic analysis, to do deep dive analysis of your logs, to do reporting of transactions that occurred during a specified period of time.

[00:51:38] – Tracy Dean

And then when we move into the recovery phase, we’ve already mentioned Recovery Solution Pack. Again, to actually perform the recovery, a High Performance Pointer Checker, to actually do the repair – that database repair facility that I talked about to repair segments. Same thing with Fast Path solution pack. I might use that in my recovery scenario if I needed to repair segments using that Database Repair Facility. And then I’ve been mentioning Queue Control Facility to replay specific transactions like the “good” transactions after my Point In Time recovery. So you can see there’s some duplication here. So this is a full list, but some things are duplicated. So the list is not quite that long. But I’ve kind of broken it down into why you would use it. So if you have other vendor tools, you can look at these words and see if your other tool will provide that function for you in each of these phases of data validation, forensic analysis, and surgical recovery.

[00:52:41] – Tracy Dean

(Slide 27) And one of the last things I want to leave you with is a pointer to a Redbook. So there’s a link in the lower right. This is a Redbook on getting started with IBM Z Cyber Vault. I will tell you there is a lot of information in here on the Cyber Vault setup and the value of Cyber Vault, what you’re going to be using it for. Lots of storage information for your storage administrators, those kinds of things. There is a chapter specifically for Db2, and there is a chapter specifically for IMS as well. So very good Redbook. I wouldn’t expect you to necessarily read it from start to finish, but it’s a good reference and a good pointer for maybe your storage people as well.

[00:53:23] – Tracy Dean

(Slide 28 – additional IMS and IMS Tools links) So this is, again, a little bit of background on Cyber Vault. Hopefully more information about how IMS plays in the Cyber Vault. What you as an IMS system programmer or DBA, need to be aware of if your company is looking at Cyber Vault. And how you might need to insert yourself and make sure that you’re staying in sync with your image copy frequencies and retentions, and making sure that you have what you need to be able to do to take advantage of Cyber Vault in terms of data validation, and make sure that IMS is participating in that, and making sure you know how to do forensic analysis if and when it happens, and making sure you know how to do recovery. Several links here to websites, the IMS website, the IMS Tools website, new functions for both IMS tools and for IMS as well. So there’s links here if you want to keep up to date on what’s going on with new functions. And of course, the listserv which many of you are already participate in as well. Many of these websites, if you’re subscribed to IBM’s My Notifications and you’ve selected IMS, you will get notified when we update these websites. You’ll get notified when we update IMS Tools new functions. You’ll get updated when we update our website, talking about support for v.15 or managed ACBS or data set encryption, those kinds of things. So if you’re not using My Notifications, I strongly suggest you just Google IBM My Notifications, sign in with your IBM ID that you use when you’re opening cases and subscribe, so you’ll get notified when these things change. With that, I’ll say thank you and see if there’s any final questions.

[00:55:26] – Amanda Hendley

Thank you, Tracy.

[00:55:28] – Tracy Dean

All right. Thanks, Amanda.

[00:55:29] – Amanda Hendley

Thanks. Well, thanks everyone for joining us. And Tracy that was great. The video is going to be available, like I said, in about a week, and the deck will be available, so you can get those links that Tracy posted. So don’t worry you can check those out and I’m sure if any questions come up, you can connect with Tracy and get those answered as well.

[00:56:00] – Amanda Hendley

So for the rest of our session today, and we’ll conclude just in a few moments, but I wanted to give you some news and articles. The first one is, since we last met, there has been an announcement from IBM about IMS. So they made an announcement about the announcement 15.4. So there’s going to be some information and updates in that that you’ll want to check out if you haven’t already. We’ve also got a job posting to share if you’re interested in making a change. There’s a posting over on CMG’s job board for remote position looking for IMS skills.

[00:57:14] – Amanda Hendley

I’m going to drop you all links to everything. And then there’s also a call for contributors over at Planet Mainframe. Now on that recent announcement from IBM, tracy brought this to my attention, and I’m curious if y’all on the call want to talk about it or share anything about it. So, a deadline that was impending several years ago, it was suggested that ACVs will be required to be managed by IMS management in the future. And in this announcement, they put a deadline on it finally. So June 2025 is that deadline. I’m curious to know what impact that deadline will have on your own processes. Does anyone have any thoughts on that or want to share? You’re welcome to come off mute and share here now, or if you follow the QR code, we’ve just set up a quick poll. We’re interested in getting your feedback on it because there is going to be an article at Planet Mainframe soon.

[00:58:44] – Amanda Hendley

So Mainframe Virtual User Group is going to be the new place to go on Twitter and also on YouTube. And what you’ll find there is the entire collection of the user group videos. So you won’t have to go to different pages to access that information. And then lastly for today, I want to announce our next session. We’re going to be talking about System Z as the enterprise information server. Stan Muse is going to be joining us in a couple of months, and we’re ready for registration to be open. So you can scan this or hit the link up in our chat. And in the meantime, if you’re not subscribed, be on the lookout though, for our newsletter. You can subscribe to that mailing list on the user group virtual IMS page. And with that, I think we are done here. Any final remarks from you, Tracy?

[00:59:50] – Tracy Dean

Just thank you for inviting me. And thanks for everyone for attending.

[00:59:53] – Amanda Hendley

Thank you all so much.

Upcoming Virtual IMS Meetings

February 11, 2025

Virtual IMS User Group Meeting

Deal with growing cybersecurity risks and do Real-time and after-the-event IMS analytics, all at the same time

Sahil Gupta
Senior Product Developer
BMC

Santosh Belgaonkar
Staff Specialist Product Developer
BMC

Register Here

April 8, 2024

Virtual IMS User Group Meeting

IMS Catalog implementation using Ansible Playbooks
Dennis Eichelberger
IBM

June 10, 2025

Virtual IMS User Group Meeting