Bug 1099391

Summary: cannot boot system installed on lvm ontop raid1 after update (Cannot activate LVs in VG vghome while PVs appear on duplicate devices)
Product: [openSUSE] openSUSE Tumbleweed Reporter: Michael Hanscho <reset11>
Component: BasesystemAssignee: heming zhao <heming.zhao>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P2 - High CC: jxu, klaus.loehel, nmoreychaisemartin, opensuse-mail, reset11, stefan.schaefer
Version: Current   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE Factory   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: vgs -vvv output
The patch was missed in lvm2 v2.x.x branch
lvm2 rpms which were built in my local machine
lvm2_2.02.182 rpm,which also includes some patches to fix this bug
rpms-for-comment43

Description Michael Hanscho 2018-06-27 20:03:25 UTC
Updating system after some time offline (more than 3000 packages changed) resulted in a system that cannot boot any more.

The reason:
Cannot activate LVs in VG vghome while PVs appear on duplicate devices.

The system uses lvm based on raid1. It seems that the PV of the raid1 is found also on the single disks that build the raid1 device:
[  147.121725] linux-472a dracut-initqueue[391]: WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV on /dev/sda2 was already found on /dev/md1.
[  147.123427] linux-472a dracut-initqueue[391]: WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV on /dev/sdb2 was already found on /dev/md1.
[  147.369863] linux-472a dracut-initqueue[391]: WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV prefers device /dev/md1 because device size is correct.
[  147.370597] linux-472a dracut-initqueue[391]: WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV prefers device /dev/md1 because device size is correct.
[  147.371698] linux-472a dracut-initqueue[391]: Cannot activate LVs in VG vghome while PVs appear on duplicate devices.

It ends up in the Dracut Emergency Shell.

Possibly bug 1097425 is a duplicate. Due to hints in the irc chat I filed nevertheless a separate bug report.

Thanks
Michael
Comment 1 Michael Hanscho 2018-06-27 20:19:00 UTC
Sorry - found out that we have a duplicate - filed today...

*** This bug has been marked as a duplicate of bug 1099329 ***
Comment 2 Michael Hanscho 2018-09-02 22:44:17 UTC
With the recent upgrade of the combination lvm2 (lvm2-2.02.180) and device-mapper (device-mapper-1.02.149)  the situation gets even more annoying.

As booting from raid1 was not possible any more I installed a fresh tumbleweed on a separate hd and mounted /home and /data from the old setup (lvm on raid). With this upgrade this is not possible any more, as now duplicate physical volumes are found on one of the raid disks and the md device (/dev/sdc2 and /dev/md126). With this I cannot mount /home and /data any more, as due to duplicate pvs the lvm cannot be activated any more.

If installing lvm2 and device-mapper from leap15 (lvm2-2.02.177-lp151.6.1, liblvm2app2_2-2.02.177-lp151.6.1, liblvm2cmd2_02-2.02.177-lp151.6.1) everything works fine again.

Gruesse
Michael
Comment 3 Gang He 2018-09-26 12:40:37 UTC
I will look at this problem, it looks there was a patch missed during the upgrade.

Thanks
Gang
Comment 4 Gang He 2018-10-09 02:17:30 UTC
Hello Michael Hanscho,

I reviewed all the patches from lvm2 2.02.177 to lvm2 2.02.180, there was not any patch missed during this upgrade, I think the problem was probably involved by the upstream code change.

Could you share the detailed steps to reproduce this issue? 
If I use data partition (not "/" root partition), I still can encounter this problem, right? I'd like to reproduce this issue in local.

Thanks
Gang
Comment 5 Michael Hanscho 2018-10-09 21:14:52 UTC
1.) Yes - in my case there is only a raid1 data partition necessary to trigger the problem.
2.) I can try to summarize my setup that triggers the issue as clear as possible.

I have 4 hds in the setup. 
sdb, sdc contain the raid1 disks; sdd the tumbleweed system, sda the leap system; I am running the tumbleweed, booting from sdd (UEFI based), sda is not active.


Tumbleweed system disk sdd:
###
gdisk -l /dev/sdd
GPT fdisk (gdisk) version 1.0.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdd: 312581808 sectors, 149.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 1E044F67-FDA7-48A7-B991-D84EEAA7A7F4
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 312581774
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         1026047   500.0 MiB   EF00  
   2         1026048       231712767   110.0 GiB   8300  
   3       231712768       312581774   38.6 GiB    8200
###

raid disk 1
###
fdisk -l /dev/sdb
Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x0006c1d3

Device     Boot   Start        End    Sectors  Size Id Type
/dev/sdb1          2048    1028095    1026048  501M fd Linux raid autodetect
/dev/sdb2       1028096 3907028991 3906000896  1.8T fd Linux raid autodetect
###

raid disk 2
###
fdisk -l /dev/sdc
Disk /dev/sdc: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x0008497b

Device     Boot   Start        End    Sectors  Size Id Type
/dev/sdc1          2048    1028095    1026048  501M fd Linux raid autodetect
/dev/sdc2       1028096 3907028991 3906000896  1.8T fd Linux raid autodetect
###


The raid 1 consists of sdb2, sdc2 and sdb1, sdc1 (sdb1, sdc1 = md127 was the former boot partition, not used any more ; sdb2, sdc2 = md126 is the raid 1, where ontop lvm is used):

###
cat /proc/mdstat 
Personalities : [raid1] 
md126 : active raid1 sdb2[0] sdc2[1]
      1953000312 blocks super 1.0 [2/2] [UU]
      bitmap: 0/15 pages [0KB], 65536KB chunk

md127 : active (auto-read-only) raid1 sdc1[1] sdb1[0]
      513012 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>
###

The physical volume:
###
pvs -v
    Wiping internal VG cache
    Wiping cache of LVM-capable devices
  PV         VG     Fmt  Attr PSize PFree   DevSize PV UUID                               
  /dev/md126 vghome lvm2 a--  1.82t 202.52g   1.82t qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV###
###

logical volumes:
###
lvs -v
  LV   VG     #Seg Attr       LSize   Maj Min KMaj KMin Pool Origin Data%  Meta%  Move Cpy%Sync Log Convert LV UUID                                LProfile
  data vghome    3 -wi-ao----   1.37t  -1  -1  254    1                                                     Qfmzak-t3DH-yWzQ-sEI3-WJ5H-leDr-a9lOU5         
  home vghome    2 -wi-ao---- 260.00g  -1  -1  254    0                                                     peXip6-9JWE-wccb-HnLL-2uJ7-ALv1-6SylWV
###

both logical volumes (data and home) are ext4 formated and mounted via fstab
/dev/mapper/vghome-home                    /home                   ext4   acl,user_xattr                1  2
/dev/mapper/vghome-data                    /data                   ext4   acl,user_xattr                1  2


This works using Tumbleweed 
NAME="openSUSE Tumbleweed"
# VERSION="20181002"
when downgrading to following packages:
lvm2-2.02.177-lp151.6.1.x86_64.rpm
liblvm2cmd2_02-2.02.177-lp151.6.1.x86_64.rpm
liblvm2app2_2-2.02.177-lp151.6.1.x86_64.rpm

When upgrading to standard package version the boot fails:

pvs shows:
  WARNING: found device with duplicate /dev/sdc2
  WARNING: found device with duplicate /dev/md126
  WARNING: Disabling lvmetad cache which does not support duplicate PVs.
  WARNING: Scan found duplicate PVs.
  WARNING: Not using lvmetad because cache update failed.
  /dev/sde: open failed: No medium found
  WARNING: Not using device /dev/sdc2 for PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV.
  WARNING: Not using device /dev/md126 for PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV.
  WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV prefers device /dev/sdb2 because of previous preference.
  WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV prefers device /dev/sdb2 because of previous preference.
  PV         VG     Fmt  Attr PSize PFree  
  /dev/sdb2  vghome lvm2 a--  1.82t 202.52g


blkid (sda excluded):
/dev/sdb1: UUID="160998c8-7e21-bcff-9cea-0bbc46454716" UUID_SUB="9e7569b0-f44a-71cb-dc8b-9238eaf18f89" LABEL="linux:0" TYPE="linux_raid_member" PARTUUID="0006c1d3-01"
/dev/sdb2: UUID="17426969-03d7-bfa7-5be3-3b0b8171417a" UUID_SUB="556aa58a-05e1-7703-3aaa-cac3cf362bf0" LABEL="linux:1" TYPE="linux_raid_member" PARTUUID="0006c1d3-02"
/dev/sdc1: UUID="160998c8-7e21-bcff-9cea-0bbc46454716" UUID_SUB="9002a7c5-9eee-1c1b-f7dc-347dfc5ee387" LABEL="linux:0" TYPE="linux_raid_member" PARTUUID="0008497b-01"
/dev/sdc2: UUID="17426969-03d7-bfa7-5be3-3b0b8171417a" UUID_SUB="63ebc1f9-8081-22bd-cb9c-ab675a27656c" LABEL="linux:1" TYPE="linux_raid_member" PARTUUID="0008497b-02"
/dev/sdd1: SEC_TYPE="msdos" UUID="1230-5B54" TYPE="vfat" PARTUUID="554ac808-9cfb-4f77-8b34-4527e44887fd"
/dev/sdd2: UUID="4d5ec84b-8585-4039-88e6-23f6d6669b7f" UUID_SUB="8d0d06b5-0e51-4821-a250-977f1b49313e" TYPE="btrfs" PARTUUID="292e17f1-dbf0-41aa-8e5f-185936a15f5b"
/dev/sdd3: UUID="e2e698ad-835b-4e5b-9df8-193d94576d65" TYPE="swap" PARTUUID="c20ec374-2eb5-4ff3-9575-28df88dc9481"
/dev/md127: UUID="f1ac825c-dd82-408a-8bc3-9c027965fb42" TYPE="ext4"
/dev/md126: UUID="qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV" TYPE="LVM2_member"
/dev/mapper/vghome-home: UUID="2ef2cbc1-64f4-494a-9426-11e6f8a1eb2b" TYPE="ext4"
/dev/mapper/vghome-data: UUID="f496aec8-9be7-4843-8cf3-847c8535c1c1" TYPE="ext4"

"Downgrading" again to leap packages allows boot again
Hope that helps...

Gruesse
Michael
Comment 6 Michael Hanscho 2018-10-09 21:35:54 UTC
Addition:

When boot fails you end up in emergency mode.
An error message before the emergency mode states:
"Failed to start LVM2 PV scan on device 9:126" and check details using command:
"systemctl status lvm2-pvscan@9:126"

When running "systemctl status lvm2-pvscan@9:126", following output can be found:
 lvm2-pvscan@9:126.service - LVM2 PV scan on device 9:126
   Loaded: loaded (/usr/lib/systemd/system/lvm2-pvscan@.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2018-10-09 23:20:04 CEST; 4min 26s ago
     Docs: man:pvscan(8)
  Process: 861 ExecStart=/usr/sbin/lvm pvscan --cache --activate ay 9:126 (code=exited, status=5)
 Main PID: 861 (code=exited, status=5)

Oct 09 23:20:04 linux-dnetctw lvm[861]:   /dev/sde: open failed: No medium found
Oct 09 23:20:04 linux-dnetctw lvm[861]:   WARNING: Not using device /dev/sdc2 for PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV.
Oct 09 23:20:04 linux-dnetctw lvm[861]:   WARNING: Not using device /dev/md126 for PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV.
Oct 09 23:20:04 linux-dnetctw lvm[861]:   WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV prefers device /dev/sdb2 because of previous preference.
Oct 09 23:20:04 linux-dnetctw lvm[861]:   Cannot activate LVs in VG vghome while PVs appear on duplicate devices.
Oct 09 23:20:04 linux-dnetctw lvm[861]:   0 logical volume(s) in volume group "vghome" now active
Oct 09 23:20:04 linux-dnetctw lvm[861]:   vghome: autoactivation failed.
Oct 09 23:20:04 linux-dnetctw systemd[1]: lvm2-pvscan@9:126.service: Main process exited, code=exited, status=5/NOTINSTALLED
Oct 09 23:20:04 linux-dnetctw systemd[1]: lvm2-pvscan@9:126.service: Failed with result 'exit-code'.
Oct 09 23:20:04 linux-dnetctw systemd[1]: Failed to start LVM2 PV scan on device 9:126.

Gruesse
Michael
Comment 7 Gang He 2018-10-10 03:38:30 UTC
Hi 

I tried to reproduce this issue with virtual machine/disk, but I can not reproduce it, when I reboot the virtual machine, the LV on MD can be mounted to /ext4 automatically.

tb0528-nd1:~ # rpm -qa | grep device-
device-mapper-1.02.149-2.1.x86_64
tb0528-nd1:~ # rpm -qa | grep lvm2
liblvm2app2_2-2.02.180-2.1.x86_64
liblvm2cmd2_02-2.02.180-2.1.x86_64
lvm2-2.02.180-2.1.x86_64
tb0528-nd1:~ # lsblk
NAME                      MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sr0                        11:0    1  3.9G  0 rom   /run/media/ghe/openSUSE-Tumbleweed-DVD-x86_6423
vda                       253:0    0   40G  0 disk
├─vda1                    253:1    0    4G  0 part  [SWAP]
└─vda2                    253:2    0   36G  0 part  /
vdb                       253:16   0   30G  0 disk
└─md0                       9:0    0   30G  0 raid1
  └─cluster--vg1-test--lv 254:0    0   10G  0 lvm   /ext4
vdc                       253:32   0   30G  0 disk
└─md0                       9:0    0   30G  0 raid1
  └─cluster--vg1-test--lv 254:0    0   10G  0 lvm   /ext4
tb0528-nd1:~ # cat /etc/fstab
UUID=f61b09ba-4d99-41ba-ab75-9c166deaff05  /      ext4  acl,user_xattr  0  1
UUID=2b54b109-8b4d-421f-a88e-9c6ab3e8f389  swap   swap  defaults        0  0
/dev/cluster-vg1/test-lv                   /ext4  ext4  defaults        0  2
tb0528-nd1:~ # pvs
  PV         VG          Fmt  Attr PSize  PFree
  /dev/md0   cluster-vg1 lvm2 a--  29.98g 19.98g
tb0528-nd1:~ # vgs
  VG          #PV #LV #SN Attr   VSize  VFree
  cluster-vg1   1   1   0 wz--n- 29.98g 19.98g
tb0528-nd1:~ # lvs
  LV      VG          Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  test-lv cluster-vg1 -wi-ao---- 10.00g
tb0528-nd1:~ # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 vdb[0] vdc[1]
      31439872 blocks super 1.2 [2/2] [UU]

unused devices: <none>
tb0528-nd1:~ #
Comment 8 Gang He 2018-10-10 03:40:59 UTC
I will reproduce this bug with the related configurations which Michael provided, the bug is related to the special scenario?

Thanks
Gang
Comment 9 Gang He 2018-10-10 08:48:06 UTC
I tried to use the similar configuration to reproduce this bug, but failed, I can reboot the system and these two LV can be mounted.

tb0528-nd1:/ # rpm -qa | grep lvm2
liblvm2app2_2-2.02.180-2.1.x86_64
liblvm2cmd2_02-2.02.180-2.1.x86_64
lvm2-2.02.180-2.1.x86_64
tb0528-nd1:/ # rpm -qa | grep device-
device-mapper-1.02.149-2.1.x86_64
tb0528-nd1:/ # lsblk
NAME                         MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sr0                           11:0    1  3.9G  0 rom   /run/media/ghe/openSUSE-Tumbleweed-DVD-x86_6423
vda                          253:0    0   40G  0 disk
├─vda1                       253:1    0    4G  0 part  [SWAP]
└─vda2                       253:2    0   36G  0 part  /
vdb                          253:16   0   30G  0 disk
├─vdb1                       253:17   0   10G  0 part
│ └─md0                        9:0    0   10G  0 raid1
│   └─cluster--vg1-test--lv2 254:1    0   18G  0 lvm   /ext4
└─vdb2                       253:18   0   20G  0 part
  └─md1                        9:1    0   20G  0 raid1
    ├─cluster--vg1-test--lv1 254:0    0   10G  0 lvm   /home
    └─cluster--vg1-test--lv2 254:1    0   18G  0 lvm   /ext4
vdc                          253:32   0   30G  0 disk
├─vdc1                       253:33   0   10G  0 part
│ └─md0                        9:0    0   10G  0 raid1
│   └─cluster--vg1-test--lv2 254:1    0   18G  0 lvm   /ext4
└─vdc2                       253:34   0   20G  0 part
  └─md1                        9:1    0   20G  0 raid1
    ├─cluster--vg1-test--lv1 254:0    0   10G  0 lvm   /home
    └─cluster--vg1-test--lv2 254:1    0   18G  0 lvm   /ext4
tb0528-nd1:/ # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 vdb2[0] vdc2[1]
      20960128 blocks super 1.0 [2/2] [UU]    <<== meta data version is 1.0
      bitmap: 0/1 pages [0KB], 65536KB chunk

md0 : active raid1 vdc1[1] vdb1[0]
      10485632 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>
tb0528-nd1:/ # cat /etc/fstab
UUID=f61b09ba-4d99-41ba-ab75-9c166deaff05  /      ext4  acl,user_xattr  0  1
UUID=2b54b109-8b4d-421f-a88e-9c6ab3e8f389  swap   swap  defaults        0  0
/dev/cluster-vg1/test-lv2                  /ext4  ext4  defaults        0  2
/dev/cluster-vg1/test-lv1                  /home  ext4  defaults        0  2
tb0528-nd1:/ #

I want to know what is the bug's root cause? a little weird.
Comment 10 Gang He 2018-10-11 02:56:13 UTC
I will use the latest openSUSE-Tumbleweed Snapshot20181004 to reproduce.
If I can not reproduce this bug, maybe it is related to hardware? 
since I only have one physical machine to create virtual machines/disks to reproduce.

Thanks
Gang
Comment 11 Gang He 2018-10-12 08:59:27 UTC
Hello Michael,

I can not reproduce this problem, even using openSUSE Tumbleweed 2018100.
I sent this problem to the upstream list, the LVM2 developer is asking,
could you help to get the related information?

Do these warnings only appear from "dracut-initqueue"?  Can you run and
send 'vgs -vvvv' from the command line?  If they don't appear from the
command line, then is "dracut-initqueue" using a different lvm.conf?
lvm.conf settings can effect this (filter, md_component_detection,
external_device_info_source).


> This is a regression bug? since the user did not encounter this problem with lvm2 v2.02.177.

It could be, since the new scanning changed how md detection works.  The
md superblock version effects how lvm detects this.  md superblock 1.0 (at
the end of the device) is not detected as easily as newer md versions
(1.1, 1.2) where the superblock is at the beginning.  Do you know which
this is?
Comment 12 Michael Hanscho 2018-10-12 18:40:54 UTC
Created attachment 785899 [details]
vgs -vvv output
Comment 13 Michael Hanscho 2018-10-12 18:42:29 UTC
Hi Gang!

I did not mention until now, that this is not a newly installed system - I even cannot remember which openSuSE version I was using when I created the raid1. These warnings are during system startup - prohibiting mounting the two lvs. So I need to downgrade first until I can use these two volumes.

vgs -vvvv (when using the working lvm versions from leap) shows output found in the attachement (>65500something characters) - vgs-vvv output

If it helps:
mdadm --detail --scan -vvv
/dev/md/linux:0:
           Version : 1.0
     Creation Time : Sun Jul 22 22:49:21 2012
        Raid Level : raid1
        Array Size : 513012 (500.99 MiB 525.32 MB)
     Used Dev Size : 513012 (500.99 MiB 525.32 MB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Jul 16 00:29:19 2018
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : linux:0
              UUID : 160998c8:7e21bcff:9cea0bbc:46454716
            Events : 469

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
/dev/md/linux:1:
           Version : 1.0
     Creation Time : Sun Jul 22 22:49:22 2012
        Raid Level : raid1
        Array Size : 1953000312 (1862.53 GiB 1999.87 GB)
     Used Dev Size : 1953000312 (1862.53 GiB 1999.87 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Oct 12 20:16:25 2018
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : linux:1
              UUID : 17426969:03d7bfa7:5be33b0b:8171417a
            Events : 326248

    Number   Major   Minor   RaidDevice State
       0       8       18        0      active sync   /dev/sdb2
       1       8       34        1      active sync   /dev/sdc2

What would you need in addition?
What should I change - or backup data and recreate the raid1?

Gruesse
Michael
Comment 14 Gang He 2018-10-16 03:14:53 UTC
Hello Michael,

There are some comments from the upstream, could you help to try their suggestions? since I can not reproduce in my local environment.

> mdadm --detail --scan -vvv
> /dev/md/linux:0:
>            Version : 1.0

It has the old superblock version 1.0 located at the end of the device, so
lvm will not always see it.  (lvm will look for it when it's writing to
new devices to ensure it doesn't clobber an md component.)

(Also keep in mind that this md superblock is no longer recommended:
raid.wiki.kernel.org/index.php/RAID_superblock_formats)

There are various ways to make lvm handle this:

- allow_changes_with_duplicate_pvs=1
- external_device_info_source="udev"
- reject sda2, sdb2 in lvm filter
Comment 15 Michael Hanscho 2018-10-16 21:02:57 UTC
Hi Gang!

Sure I will try to help!
I tested one by one option in the lvm.conf.

The good news - enabling 
- external_device_info_source="udev"
- reject sda2, sdb2 in lvm filter

both work! The system enables the proper lvm raid1 device again.

The first option does not work.
systemctl status lvm2-pvscan@9:126 results in:

● lvm2-pvscan@9:126.service - LVM2 PV scan on device 9:126
   Loaded: loaded (/usr/lib/systemd/system/lvm2-pvscan@.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2018-10-16 22:53:57 CEST; 3min 4s ago
     Docs: man:pvscan(8)
  Process: 849 ExecStart=/usr/sbin/lvm pvscan --cache --activate ay 9:126 (code=exited, status=5)
 Main PID: 849 (code=exited, status=5)

Oct 16 22:53:57 linux-dnetctw lvm[849]:   WARNING: Not using device /dev/md126 for PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV.
Oct 16 22:53:57 linux-dnetctw lvm[849]:   WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV prefers device /dev/sdb2 because of previous preference.
Oct 16 22:53:57 linux-dnetctw lvm[849]:   WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV prefers device /dev/sdb2 because of previous preference.
Oct 16 22:53:57 linux-dnetctw lvm[849]:   device-mapper: reload ioctl on  (254:0) failed: Device or resource busy
Oct 16 22:53:57 linux-dnetctw lvm[849]:   device-mapper: reload ioctl on  (254:0) failed: Device or resource busy
Oct 16 22:53:57 linux-dnetctw lvm[849]:   0 logical volume(s) in volume group "vghome" now active
Oct 16 22:53:57 linux-dnetctw lvm[849]:   vghome: autoactivation failed.
Oct 16 22:53:57 linux-dnetctw systemd[1]: lvm2-pvscan@9:126.service: Main process exited, code=exited, status=5/NOTINSTALLED
Oct 16 22:53:57 linux-dnetctw systemd[1]: lvm2-pvscan@9:126.service: Failed with result 'exit-code'.
Oct 16 22:53:57 linux-dnetctw systemd[1]: Failed to start LVM2 PV scan on device 9:126.

pvs shows:
  /dev/sde: open failed: No medium found
  WARNING: found device with duplicate /dev/sdc2
  WARNING: found device with duplicate /dev/md126
  WARNING: Disabling lvmetad cache which does not support duplicate PVs.
  WARNING: Scan found duplicate PVs.
  WARNING: Not using lvmetad because cache update failed.
  /dev/sde: open failed: No medium found
  WARNING: Not using device /dev/sdc2 for PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV.
  WARNING: Not using device /dev/md126 for PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV.
  WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV prefers device /dev/sdb2 because of previous preference.
  WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV prefers device /dev/sdb2 because of previous preference.
  PV         VG     Fmt  Attr PSize PFree  
  /dev/sdb2  vghome lvm2 a--  1.82t 202.52g

Is there anything else I can provide?

Gruesse
Michael
Comment 16 Gang He 2018-10-18 10:11:09 UTC
Created attachment 786400 [details]
The patch was missed in lvm2 v2.x.x branch
Comment 17 Gang He 2018-10-18 10:12:53 UTC
I will create a new lvm2 package to include the above patch tomorrow, then you can download it to test if it can help this problem.

Thanks
Gang
Comment 18 Gang He 2018-10-19 05:35:15 UTC
The update is from the upstream,
there are three commits for fixing this problem,
d1b652143abc tests: add new test for lvm on md devices
e7bb50880901 scan: enable full md filter when md 1.0 devices are present
de2863739f2e scan: use full md filter when md 1.0 devices are present


Hello Michael,

could you help to try the new rpms from my branch to verify if the patches can fix this bug?

https://build.opensuse.org/package/binaries/home:ganghe:branches:openSUSE:Factory/lvm2/openSUSE_Factory

Thanks
Gang
Comment 21 Gang He 2018-10-22 03:37:28 UTC
Created attachment 786623 [details]
lvm2 rpms which were built in my local machine
Comment 22 Gang He 2018-10-22 03:39:38 UTC
Hello Michael,

Could you try my local lvm2 rpms to verify this bug? 

Thanks
Gang
Comment 23 Michael Hanscho 2018-10-22 21:39:09 UTC
Hi Gang!

In a first round I installed lvm2-2.02.180-0.x86_64.rpm liblvm2cmd2_02-2.02.180-0.x86_64.rpm and liblvm2app2_2-2.02.180-0.x86_64.rpm - but no luck - after reboot still the same problem with ending up in the emergency console.

I additionally installed in the next round libdevmapper-event1_03-1.02.149-0.x86_64.rpm, ./libdevmapper1_03-1.02.149-0.x86_64.rpm and device-mapper-1.02.149-0.x86_64.rpm, again - ending up in the emergency console

systemctl status lvm2-pvscan@9:126 output: 
lvm2-pvscan@9:126.service - LVM2 PV scan on device 9:126
   Loaded: loaded (/usr/lib/systemd/system/lvm2-pvscan@.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2018-10-22 07:34:56 CEST; 5min ago
     Docs: man:pvscan(8)
  Process: 815 ExecStart=/usr/sbin/lvm pvscan --cache --activate ay 9:126 (code=exited, status=5)
 Main PID: 815 (code=exited, status=5)

Oct 22 07:34:55 linux-dnetctw lvm[815]:   WARNING: Autoactivation reading from disk instead of lvmetad.
Oct 22 07:34:56 linux-dnetctw lvm[815]:   /dev/sde: open failed: No medium found
Oct 22 07:34:56 linux-dnetctw lvm[815]:   WARNING: Not using device /dev/md126 for PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV.
Oct 22 07:34:56 linux-dnetctw lvm[815]:   WARNING: PV qG1QRz-Ivm1-QVwq-uaHV-va9w-wwXh-lIIOhV prefers device /dev/sdb2 because of previous preference.
Oct 22 07:34:56 linux-dnetctw lvm[815]:   Cannot activate LVs in VG vghome while PVs appear on duplicate devices.
Oct 22 07:34:56 linux-dnetctw lvm[815]:   0 logical volume(s) in volume group "vghome" now active
Oct 22 07:34:56 linux-dnetctw lvm[815]:   vghome: autoactivation failed.
Oct 22 07:34:56 linux-dnetctw systemd[1]: lvm2-pvscan@9:126.service: Main process exited, code=exited, status=5/NOTINSTALLED
Oct 22 07:34:56 linux-dnetctw systemd[1]: lvm2-pvscan@9:126.service: Failed with result 'exit-code'.
Oct 22 07:34:56 linux-dnetctw systemd[1]: Failed to start LVM2 PV scan on device 9:126.

Gruesse
Michael
Comment 24 Gang He 2018-10-23 02:28:33 UTC
Hello Michael,

Thank for your efforts, I will discuss this feedback with the upstream.
By the way, could you try the solution 1) to see if it works ?

There are various ways to make lvm handle this:

- allow_changes_with_duplicate_pvs=1
- external_device_info_source="udev"
- reject sda2, sdb2 in lvm filter


Thanks
Gang
Comment 27 Gang He 2018-12-06 05:29:17 UTC
Created attachment 792013 [details]
lvm2_2.02.182 rpm,which also includes some patches to fix this bug
Comment 28 Gang He 2018-12-06 05:33:11 UTC
Hello Michael,

Sorry for interruption again.
The upstream developer submitted more patches, to fix this bug.
I have created a new lvm2 rpms, which includes these patches, see comment #27.

Could you help to verify if the new rpms works on your case?


Thanks
Gang
Comment 29 Michael Hanscho 2018-12-08 23:23:02 UTC
Hi Gang!

Sorry for the delayed answer:
In the meantime the base system changed a little bit - I updated Tumbleweed to:
openSUSE Tumbleweed"
# VERSION="20181129"

In addition I added 2 additional HDs and copied the data to the new ones. Again a raid1, but of course with actual configuration.
The "old raid1" is still available and therefore I think tests are ok.

I installed following packages and disabled the workaround in the lvm.conf
device-mapper-1.02.152-0.x86_64.rpm
libdevmapper1_03-1.02.152-0.x86_64
libdevmapper-event1_03-1.02.152-0.x86_64
liblvm2cmd2_02-2.02.182-0.x86_64
liblvm2app2_2-2.02.182-0.x86_64
lvm2-2.02.182-0.x86_64

It still does not work out and ends up in the emergency console, although error messages are different now:
in journal (journalctl -xb):

Unit lvm2-pvscan@259:0.service has finished starting up.
[...]
Dec 08 15:27:28 big lvmetad[560]: vg_lookup vgid 6zO4JX-piC2-OaCG-GptZ-MT2F-NH4J-jL6gNj name vghome found incomplete mapping uuid none name none
[...]
Dec 08 15:28:57 big systemd[1]: dev-vghome-data.device: Job dev-vghome-data.device/start timed out.
Dec 08 15:28:57 big systemd[1]: Timed out waiting for device dev-vghome-data.device.
-- Subject: Unit dev-vghome-data.device has failed
[...]

During boot following can be seen:
"A start job is running for dev-vghome-data.device (counting secs/1min 30s)"

This one fails and emergency console starts.

What can I provide additionally?

Gruesse
Michael
Comment 30 Michael Hanscho 2018-12-09 16:42:35 UTC
Hi Gang!

Another information: 
When ending up in the emergency console it is possible to get the needed lvms online by running "systemctl restart lvm2-pvscane@9:127"...

This is not possible with the lvm2 (and device-manager) version delivered by Tumbleweed 20181206.
So definitely an improvement...

Gruesse
Michael
Comment 31 Nicolas Morey-Chaisemartin 2019-05-10 15:20:08 UTC
I just encountered something similar after upgrading my NAS from Leap 15.0 to 15.1

It uses a soft raid5 on 5 disks.
Works fine on 15.0
lvmetad complains about duplicate PV after the update to 15.1 ans lvm2-pgscan service fails to start causing the system to go to the emergency console.

I downgraded lvm2 and the liblvm2* to the 15.0 version (2.02.177) and it's working again
Comment 32 Klaus Loehel 2019-05-19 11:10:59 UTC
I encountered something similar after upgrading from Leap 15.0 to 15.1, too. But I use a RAID1. The lvm2-pgscan service fails to start causing the system to go to the emergency console. Could you please fix this.
Comment 33 Matthew Gibbs 2019-05-28 14:45:49 UTC
I also unfortunately encountered this issue upgrading from 15.0 to 15.1.  I have an LVM on MD raid 1 setup that I set up a long time ago on a previous OpenSuse version.  Maybe this should be added to Most Annoying Bugs?  One should definitely not upgrade if they have this configuration until they can determine whether they will be affected.  As these seem to be pretty mature and reliable subsystems I was quite surprised to reboot my server and not have the data stores come up and end up in an emergency console.

Perhaps a small migration tool to upgrade the array to a newer version as noted below to avoid this problem in the future would help.

Thank you,

Matt
Comment 34 Stefan Schäfer 2019-10-11 06:42:45 UTC
Hi,

same problem here. After Upgrading to leap 15.1 the system refuses to start, caused by a supposed degraded md-device. Booting the rescue system shows that all md devices are ok.

It seems to happen only if there is more than one raid1 md-device. I will verify that.

I also use lvm on top of md raid.

This is very critical!

Stefan
Comment 35 Stefan Schäfer 2019-10-11 06:58:44 UTC
The issue is reproducible. A clean reinstalled leap 15.1 system refuses to start after installation.

The disk setup:

2 1TB disks -> raid1 -> md0
2 2TB disks -> raid1 -> md1

Both md-devices are in the same lv-group "backup". The group contains 3 volumes "swap", "root" and "srv".

lv "root" is mounted to / and has a btrfs fs.
lv "srv" is mounted to /srv and has a ext4 fs.

Same as before, starting a rescue system shows that everything is ok.

Stefan
Comment 36 Stefan Schäfer 2019-10-11 08:51:00 UTC
Now i tried a step by step setup.

Installing the system as mentioned before, but with just 2 disks and the same md and lvm setup. Everything ist fine.

Next step, adding the next 2 disks and build a raid1 device with partitions "sdc1" and "sdd1".

Reboot works.

Turning md1 into a lvm pv.

Reboot works.

Adding the new pv to the existing vg.

Reboot ends up in an emergency console.

Stefan
Comment 37 heming zhao 2019-10-11 09:11:31 UTC
(In reply to Stefan Schäfer from comment #36)
Hello Stefan Schäfer,

Thank you for your infomation.
Could you show your steps in command style.
I'm stupid to understand your statements (both comment #35 & #36).

Thank you.  

> Now i tried a step by step setup.
> 
> Installing the system as mentioned before, but with just 2 disks and the
> same md and lvm setup. Everything ist fine.
> 
> Next step, adding the next 2 disks and build a raid1 device with partitions
> "sdc1" and "sdd1".
> 
> Reboot works.
> 
> Turning md1 into a lvm pv.
> 
> Reboot works.
> 
> Adding the new pv to the existing vg.
> 
> Reboot ends up in an emergency console.
> 
> Stefan
Comment 38 Stefan Schäfer 2019-10-11 10:45:28 UTC
I did the primary setup by yast and the yast partitioner.

Two disks with two partitions (sda1/sdb1 = 8mb Type "BIOS BOOT Partition", then the whole remaining space in sda2/sdb2 Type LINUX RAID)

Then with Yast create a raid1 device md0, which builds the only member of a LVM Volume-Group, called backup (in my case). In this VG, create three Volumes "swap", "root" and "srv", like discribed before.

After the first reboot add the next two disks and create one primary partition od Type Linux RAID on it and create a second raid1 device:

mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd1

Create a physical volume from this new md-Device:

pvcreate /dev/md1

Add this new pv to the existing Volume Group:

vgextend backup /dev/md1

Reboot fails after the last step.

The good news: It works with the packages mentioned in comment #27

Stefan
Comment 39 heming zhao 2019-10-12 07:29:21 UTC
Hello Stefan Schäfer,

For your comment #38, which lvm2 version do you use.

the comment #27 fix had been contained by latest tumbleweed lvm2 (version 2.03.05).
If you use old tumbleweed lvm2, you can redo the test with latest tumbleweed, this bug should vanish.
Comment 40 Stefan Schäfer 2019-10-12 08:07:00 UTC
Hi,

sorry I know that this bugreport is about tumbleweed, but we are using leap 15.1. There we have the bug too, which means that the fixed packages should also released for leap 15.1.

The leap 15.1 update repositories ships lvm2 version 2.02.180-lp151.4.3.1.

Stefan
Comment 41 heming zhao 2019-10-14 06:52:57 UTC
1099391, 1099329 & 1136641 are same bug.
need below patch (under lvm2 stable-2.02 branch):
these patch had been merged in tumbleweed (lvm2-2.03.05)

I will push these patch into sles 15sp1
patch:
```
commit a188b1e513ed5ca0f5f3702c823490f5610d4495
Author: David Teigland <teigland@redhat.com>
Date:   Fri Nov 30 16:32:32 2018 -0600

    pvscan lvmetad: use udev info to improve md component detection


commit a01e1fec0fe7c2fa61577c0e636e907cde7279ea
Author: David Teigland <teigland@redhat.com>
Date:   Thu Nov 29 14:06:20 2018 -0600

    pvscan lvmetad: use full md filter when md 1.0 devices are present


commit 0e42ebd6d4012d210084a9ccf8d76f853726de3c
Author: Peter Rajnoha <prajnoha@redhat.com>
Date:   Thu Nov 29 11:51:05 2018 -0600

    scan: md metadata version 0.90 is at the end of disk


commit e7bb50880901a4462e350ce0d272a63aa8440781
Author: David Teigland <teigland@redhat.com>
Date:   Thu Oct 18 11:32:32 2018 -0500

    scan: enable full md filter when md 1.0 devices are present


commit de2863739f2ea17d89d0e442379109f967b5919d
Author: David Teigland <teigland@redhat.com>
Date:   Fri Jun 15 11:42:10 2018 -0500

    scan: use full md filter when md 1.0 devices are present


commit c527a0cbfc391645d30407d2dc4a30275c6472f1
Author: David Teigland <teigland@redhat.com>
Date:   Mon Aug 27 11:15:35 2018 -0500

    lvmetad: improve scan for pvscan all
```
Comment 42 heming zhao 2019-10-14 06:57:50 UTC
1099391, 1099329 & 1136641 are same as bug 1145231.
Comment 43 heming zhao 2019-10-14 07:26:02 UTC
Hello Stefan Schäfer,

Could you test rpm packages in my home:
https://build.suse.de/package/show/home:hmzhao:branches:SUSE:SLE-15-SP1:Update/lvm2
Comment 44 Stefan Schäfer 2019-10-14 08:02:46 UTC
Hello Heming,

i cannot access your link. The Host build.suse.de seems to be unresolveable. 

Stefan
Comment 45 heming zhao 2019-10-14 08:14:39 UTC
Created attachment 821377 [details]
rpms-for-comment43

I downloaded for you. Please test them.
Comment 46 Stefan Schäfer 2019-10-14 17:54:40 UTC
Looks good. I'm not able to reproduce the problem with your packages installed on my test-system.

Stefan
Comment 47 heming zhao 2019-10-15 11:00:01 UTC
thank your replay, please wait for pushing code into sles-15sp1.
Comment 48 Stefan Schäfer 2019-11-16 18:19:30 UTC
the fixed packages aren't pushed to leap 15.1 yet. What's the problem?

Stefan
Comment 49 heming zhao 2019-11-18 02:56:00 UTC
Thank you for your information. 
The leap-15.1 is generated by auto backporting from suse sles-15sp1. I had pushed the fixed code in 15sp1 about 20 days ago. But the fixed request is still in test phase.
I can only tell you please be patient.

I am not sure whether you can see my private project. You can download the fixed package in it:
https://build.opensuse.org/package/show/home:hmzhao:branches:openSUSE:Leap:15.1:Update/lvm2
This bug related patch files style: bug-1145231_xxx.patch
Comment 50 heming zhao 2020-01-06 02:38:10 UTC
close it. the fixed codes had been merged in leap15.1