SSE指令：哪些CPU可以进行原子16B内存操作？-Java 学习之路

考虑在x86 CPU上进行单个内存访问（单个读取或单个写入，而不是读取写入）SSE指令 . 该指令访问16字节（128位）的存储器，访问的存储器位置对齐为16字节 .

文档“英特尔®64架构内存订购白皮书”指出，对于“读取或写入地址在8字节边界上对齐的四字（8字节）的指令”，内存操作似乎作为单个内存访问执行，而不管记忆类型 .

问题： Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? 是这样，它是哪种特定类型的CPU（Core2 / Atom / K8 / Phenom / ...）？如果您对此问题提供答案（是/否）， please also specify the method 用于确定答案 - PDF文档查找，强力测试，数学证明或您用于确定答案的任何其他方法 .

这个问题与http://research.swtch.com/2010/02/off-to-races.html等问题有关

更新：

我在C中创建了一个可以在您的计算机上运行的简单测试程序 . 请在您的Phenom，Athlon，Bobcat，Core2，Atom，Sandy Bridge或您碰巧拥有的任何支持SSE2的CPU上编译并运行它 . 谢谢 .

// Compile with:
//   gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.

#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;

unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));

void* thread1(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n1[mask]++;

                x = (v4si){0,0,0,0};
        }
        return NULL;
}

void* thread2(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n2[mask]++;

                x = (v4si){-1,-1,-1,-1};
        }
        return NULL;
}

int main() {
        // Check memory alignment
        if ( (((uintptr_t)&x) & 0x0f) != 0 )
                abort();

        memset(n1, 0, sizeof(n1));
        memset(n2, 0, sizeof(n2));

        pthread_t t1, t2;
        pthread_create(&t1, NULL, thread1, NULL);
        pthread_create(&t2, NULL, thread2, NULL);
        pthread_join(t1, NULL);
        pthread_join(t2, NULL);

        for (unsigned i=0; i<16; i++) {
                for (int j=3; j>=0; j--)
                        printf("%d", (i>>j)&1);

                printf("  %10u %10u", n1[i], n2[i]);
                if(i>0 && i<0x0f) {
                        if(n1[i] || n2[i])
                                printf("  Not a single memory access!");
                }

                printf("\n");
        }

        return 0;
}

我笔记本电脑中的CPU是Core Duo（不是Core2） . 这个特殊的CPU未通过测试，它实现了16字节的内存读/写，粒度为8字节 . 输出是：

0000    96905702      10512
0001           0          0
0010           0          0
0011          22      12924  Not a single memory access!
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100     3092557       1175  Not a single memory access!
1101           0          0
1110           0          0
1111        1719   99975389

6 回答

在Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A，现在包含你提到的内存订购白皮书的规格，在8.2.3.1节中说，正如你自己注意到的那样，

The Intel-64 memory ordering model guarantees that, for each of the following 
memory-access instructions, the constituent memory operation appears to execute 
as a single memory access:

• Instructions that read or write a single byte.
• Instructions that read or write a word (2 bytes) whose address is aligned on a 2
byte boundary.
• Instructions that read or write a doubleword (4 bytes) whose address is aligned
on a 4 byte boundary.
• Instructions that read or write a quadword (8 bytes) whose address is aligned on
an 8 byte boundary.

Any locked instruction (either the XCHG instruction or another read-modify-write
 instruction with a LOCK prefix) appears to execute as an indivisible and 
uninterruptible sequence of load(s) followed by store(s) regardless of alignment.

现在，由于上面的列表不包含双四字（16字节）的相同语言，因此架构不保证访问16字节内存的指令是原子的 .

话虽如此，最后一段确实提示了一条出路，即带有LOCK前缀的CMPXCHG16B指令 . 您可以使用CPUID指令确定您的处理器是否支持CMPXCHG16B（“CX16”功能位） .

在相应的AMD文档AMD64 Technology AMD64 Architecture Programmer’s Manual Volume 2: System Programming中，我找不到类似的清晰语言 .

EDIT: Test program results

（修改测试程序以将#iterations增加10倍）

在Xeon X3450（x86-64）上：

0000   999998139       1572
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        1861  999998428

在Xeon 5150（32位）上：

0000   999243100     283087
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111      756900  999716913

在Opteron 2435（x86-64）上：

0000   999995893       1901
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        4107  999998099

这是否意味着英特尔和/或AMD保证这些机器上的16字节内存访问是原子的？恕我直言，它没有 . 它不是文档中保证的架构行为，因此无法知道在这些特定处理器上16字节内存访问是否真的是原子的，或者测试程序是否因某种原因而无法触发它们 . 因此依赖它是危险的 .

EDIT 2: How to make the test program fail

哈！我设法使测试程序失败 . 在与上面相同的Opteron 2435上，使用相同的二进制文件，但现在通过“numactl”工具运行它，指定每个线程在单独的套接字上运行，我得到：

0000   999998634       5990
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          1  Not a single memory access!
1101           0          0
1110           0          0
1111        1366  999994009

那么这意味着什么呢？好吧，Opteron 2435可能会或可能不会保证16字节内存访问对于套接字内访问来说是原子的，但至少在两个套接字之间的HyperTransport互连上运行的缓存一致性协议并不能提供这样的保证 .

EDIT 3: ASM for the thread functions, on request of "GJ."

这是为Opteron 2435系统上使用的GCC 4.4 x86-64版本的线程函数生成的asm：

.globl thread2
        .type   thread2, @function
thread2:
.LFB537:
        .cfi_startproc
        movdqa  .LC3(%rip), %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L11:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n2(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L11
        xorl    %eax, %eax
        ret
        .cfi_endproc
.LFE537:
        .size   thread2, .-thread2
        .p2align 5,,31
.globl thread1
        .type   thread1, @function
thread1:
.LFB536:
        .cfi_startproc
        pxor    %xmm1, %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L15:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n1(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L15
        xorl    %eax, %eax
        ret
        .cfi_endproc

为了完整性，.LC3是包含thread2使用的（-1，-1，-1，-1）向量的静态数据：

.LC3:
        .long   -1
        .long   -1
        .long   -1
        .long   -1
        .ident  "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)"
        .section        .note.GNU-stack,"",@progbits

另请注意，这是AT＆T ASM语法，而不是Windows程序员可能更熟悉的英特尔语法 . 最后，这是march = native，这使得GCC更喜欢MOVAPS;但没关系，如果我使用march = core2它将使用MOVDQA存储到x，我仍然可以重现失败 .

回复于 2024-05-17T15:43:51+08:00

2

"AMD Architecture Programmer's Manual Volume 1: Application Programming"在第3.9.1节中说：“ CMPXCHG16B 可用于在64位模式下执行16字节原子访问（具有某些对齐限制） . ”

但是，没有关于SSE指令的评论 . 实际上，4.8.3中有一条注释，即LOCK前缀"causes an invalid-opcode exception when used with 128-bit media instructions" . 因此，对于我来说，AMD处理器不保证对SSE指令进行原子128位访问，并且进行原子128位访问的唯一方法是使用 CMPXCHG16B ，这似乎是非常确定的 .

“Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1 " says in 8.1.1 "一个x87指令或访问大于四字的数据的SSE指令可以使用多个存储器访问来实现 . ”这是128位SSE的结论ISA不保证说明原则 . Volume 2A英特尔文档中提及 CMPXCHG16B ："This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically."

此外，在这种情况下，CPU制造商尚未针对特定CPU型号发布原子128b SSE操作的书面保证 .

回复于 2024-05-17T15:43:51+08:00
3

英特尔架构手册第3A卷实际上有一个警告 . 第8.1.1节（2011年5月），在保证原子操作部分下：

可以使用多个存储器访问来实现访问大于四字的数据的x87指令或SSE指令 . 如果这样的指令存储到存储器，则一些访问可以完成（写入存储器）而另一个访问导致操作因架构原因而出错（例如，由于标记为“不存在”的页表条目） . 在这种情况下，即使整个指令导致故障，软件也可以看到完成的访问的效果 . 如果TLB失效被延迟（参见第4.10.4.4节），即使所有访问都在同一页面，也可能发生此类页面错误 .

因此SSE指令不保证是原子的，即使底层架构确实使用单个内存访问（这是引入内存防护的一个原因） .

将其与英特尔优化手册第13.3节（2011年4月）中的声明相结合

AVX和FMA指令不会引入任何新的保证原子内存操作 .

而事实上SIMD的加载或存储操作都没有保证原子性，我们可以得出结论，英特尔不支持任何形式的原子SIMD（尚未） .

作为额外的一点，如果内存沿着缓存行或页面边界分割（当使用允许未对齐访问的 movdqu 之类的东西时），以下处理器将不执行原子访问，无论对齐，但后来的处理器将（再次来自英特尔）建筑手册）：

Intel Core 2 Duo，Intel®Atom™，Intel Core Duo，Pentium M，Pentium 4，Intel Xeon，P6系列，Pentium和Intel486处理器 . Intel Core 2 Duo，Intel Atom，Intel Core Duo，Pentium M，Pentium 4，Intel Xeon和P6系列处理器

回复于 2024-05-17T15:43:51+08:00
32

x86 ISA不保证任何大于8B的原子性，因此实现可以像Pentium III / Pentium M / Core Duo那样自由地实现SSE / AVX支持：内部数据以64位半数处理 . 128位商店作为两个64位商店完成 . 在Yonah微体系结构（Core Duo）中，到/从缓存的数据路径只有64b宽 . （来源：Agner Fog's microarch doc） .

更新的实现在内部具有更宽的数据路径，并且将128b指令作为单个op处理 . Core 2 Duo（conroe / merom）是第一款具有128b数据路径的英特尔P6下行微型计算机 . （IDK关于P4，但幸运的是它已经足够老了，完全无关紧要 . ）

这就是OP发现128b操作在英特尔酷睿双核（Yonah）上不是原子的原因，但其他海报发现它们在后来的英特尔设计中是原子的，从核心2（Merom）开始 .

The diagrams on this Realworldtech writeup about Merom vs. Yonah显示Merom（和P4）中ALU和L1数据缓存之间的128位路径，而低功率Yonah具有64位数据路径 . 在所有3种设计中，L1和L2缓存之间的数据路径为256b .

数据路径宽度的下一个跳跃来自Intel的Haswell, featuring 256b (32B) AVX/AVX2 loads/stores，以及L1和L2缓存之间的64Byte路径 . 我希望在Haswell，Broadwell和Skylake中256b加载/存储是原子的，但我没有't have one to test. I forget if Skylake again widened the paths in preparation for AVX512 in Skylake-EP (the server version), or if perhaps the initial implementation of AVX512 will be like SnB/IvB'的AVX，并且有512b加载/存储占用加载/存储端口2个周期 .

正如janneb在他出色的实验答案中指出的那样，多核系统中套接字之间的缓存一致性协议可能比共享最后一级缓存CPU中的套接字更窄 . 对于宽负载/存储，原子性没有架构要求，因此设计人员可以自由地在套接字内使它们成为原子，但如果方便则可以在套接字内使用非原子 . IDK对于AMD的Bulldozer系列或Intel的插槽间逻辑数据路径有多宽 . （我说“逻辑”，因为即使数据以较小的块传输，它也可能不会在完全接收之前修改缓存行 . ）

查找有关AMD CPU的类似文章应该可以得出关于128b操作是否是原子的合理结论 . 只需检查指令表是一些帮助：

K8将 movaps reg, [mem] 解码为2 m-ops，而K10和推土机系列将其解码为1 m-op . AMD的低功耗山猫将其解码为2个操作系统，而美洲虎则将128b移动解码为1个操作系统 . （它支持类似于推土机系列CPU的AVX1：256b insn（甚至ALU操作）被分成两个128b操作 . Intel SnB仅拆分256b加载/存储，同时具有全宽ALU . ）

janneb的Opteron 2435是6-core Istanbul CPU, which is part of the K10 family，所以这个单m-op - >原子结论在单个插槽中看起来很准确 .

英特尔Silvermont使用单个uop进行128b加载/存储，每个时钟的吞吐量为1 . 这与整数加载/存储相同，因此很可能是原子的 .

回复于 2024-05-17T15:43:51+08:00

-1

EDIT: 在过去的两天里，我在我的三台电脑上进行了几次测试，而且我没有更准确地说出任何东西 . 也许这个内存错误也依赖于操作系统 .

EDIT: 我正在使用Delphi进行编程而不是使用C进行编程但是我应该理解C.所以我已经翻译了代码，这里有你的主要部分在汇编程序中的线程程序：

procedure TThread1.Execute;
var
  n             :cardinal;
const
  ConstAll0     :array[0..3] of integer =(0,0,0,0);
begin
  for n := 0 to 100000000 do
    asm
      movdqa    xmm0, dqword [x]
      movmskps  eax, xmm0
      inc       dword ptr[n1 + eax *4]
      movdqu    xmm0, dqword [ConstAll0]
      movdqa    dqword [x], xmm0
    end;
end;

{ TThread2 }

procedure TThread2.Execute;
var
  n             :cardinal;
const
  ConstAll1     :array[0..3] of integer =(-1,-1,-1,-1);
begin
  for n := 0 to 100000000 do
    asm
      movdqa    xmm0, dqword [x]
      movmskps  eax, xmm0
      inc       dword ptr[n2 + eax *4]
      movdqu    xmm0, dqword [ConstAll1]
      movdqa    dqword [x], xmm0
    end;
end;

结果： no mistake on my quad core PC and no mistake on my dual core PC as expected!

配备Intel Pentium4 CPU的PC
配备Intel Core2 Quad CPU Q6600的PC
配备Intel Core2 Duo CPU P8400的PC

你能说明debuger如何看待你的线程程序代码吗？请...

回复于 2024-05-17T15:43:51+08:00

0

到目前为止已经发布了很多答案，因此很多信息已经可用（副作用也很多混乱） . 我想从英特尔手册中找到关于硬件保证原子操作的事实......

In Intel's latest processors of nehalem and sandy bridge family, reading or writing to a quadword aligned to 64 bit boundary is guaranteed.

即使未对齐的2,4或8字节读取或写入也保证是原子的，只要它们是高速缓存的内存并适合高速缓存行 .

已经说过在这个问题中发布的测试通过基于沙桥的intel i5处理器 .

回复于 2024-05-17T15:43:51+08:00

SSE指令：哪些CPU可以进行原子16B内存操作？

6 回答

相关问题