llvm.org GIT mirror llvm / a813b38
Change semantics of fadd/fmul vector reductions. This patch changes how LLVM handles the accumulator/start value in the reduction, by never ignoring it regardless of the presence of fast-math flags on callsites. This change introduces the following new intrinsics to replace the existing ones: llvm.experimental.vector.reduce.fadd -> llvm.experimental.vector.reduce.v2.fadd llvm.experimental.vector.reduce.fmul -> llvm.experimental.vector.reduce.v2.fmul and adds functionality to auto-upgrade existing LLVM IR and bitcode. Reviewers: RKSimon, greened, dmgreen, nikic, simoll, aemerson Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D60261 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@363035 91177308-0d34-0410-b5e6-96231b3b80d8 Sander de Smalen 3 months ago
19 changed file(s) with 812 addition(s) and 627 deletion(s). Raw diff Collapse all Expand all
1373213732 """"""""""
1373313733 The argument to this intrinsic must be a vector of integer values.
1373413734
13735 '``llvm.experimental.vector.reduce.fadd.*``' Intrinsic
13736 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13737
13738 Syntax:
13739 """""""
13740
13741 ::
13742
13743 declare float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float %acc, <4 x float> %a)
13744 declare double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double %acc, <2 x double> %a)
13745
13746 Overview:
13747 """""""""
13748
13749 The '``llvm.experimental.vector.reduce.fadd.*``' intrinsics do a floating-point
13735 '``llvm.experimental.vector.reduce.v2.fadd.*``' Intrinsic
13736 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13737
13738 Syntax:
13739 """""""
13740
13741 ::
13742
13743 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %a)
13744 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64(double %start_value, <2 x double> %a)
13745
13746 Overview:
13747 """""""""
13748
13749 The '``llvm.experimental.vector.reduce.v2.fadd.*``' intrinsics do a floating-point
1375013750 ``ADD`` reduction of a vector, returning the result as a scalar. The return type
1375113751 matches the element-type of the vector input.
1375213752
13753 If the intrinsic call has fast-math flags, then the reduction will not preserve
13754 the associativity of an equivalent scalarized counterpart. If it does not have
13755 fast-math flags, then the reduction will be *ordered*, implying that the
13756 operation respects the associativity of a scalarized reduction.
13757
13758
13759 Arguments:
13760 """"""""""
13761 The first argument to this intrinsic is a scalar accumulator value, which is
13762 only used when there are no fast-math flags attached. This argument may be undef
13763 when fast-math flags are used. The type of the accumulator matches the
13764 element-type of the vector input.
13765
13753 If the intrinsic call has the 'reassoc' or 'fast' flags set, then the
13754 reduction will not preserve the associativity of an equivalent scalarized
13755 counterpart. Otherwise the reduction will be *ordered*, thus implying that
13756 the operation respects the associativity of a scalarized reduction.
13757
13758
13759 Arguments:
13760 """"""""""
13761 The first argument to this intrinsic is a scalar start value for the reduction.
13762 The type of the start value matches the element-type of the vector input.
1376613763 The second argument must be a vector of floating-point values.
1376713764
1376813765 Examples:
1377013767
1377113768 .. code-block:: llvm
1377213769
13773 %fast = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %input) ; fast reduction
13774 %ord = call float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float %acc, <4 x float> %input) ; ordered reduction
13770 %unord = call reassoc float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.0, <4 x float> %input) ; unordered reduction
13771 %ord = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %input) ; ordered reduction
1377513772
1377613773
1377713774 '``llvm.experimental.vector.reduce.mul.*``' Intrinsic
1379613793 """"""""""
1379713794 The argument to this intrinsic must be a vector of integer values.
1379813795
13799 '``llvm.experimental.vector.reduce.fmul.*``' Intrinsic
13800 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13801
13802 Syntax:
13803 """""""
13804
13805 ::
13806
13807 declare float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float %acc, <4 x float> %a)
13808 declare double @llvm.experimental.vector.reduce.fmul.f64.v2f64(double %acc, <2 x double> %a)
13809
13810 Overview:
13811 """""""""
13812
13813 The '``llvm.experimental.vector.reduce.fmul.*``' intrinsics do a floating-point
13796 '``llvm.experimental.vector.reduce.v2.fmul.*``' Intrinsic
13797 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13798
13799 Syntax:
13800 """""""
13801
13802 ::
13803
13804 declare float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float %start_value, <4 x float> %a)
13805 declare double @llvm.experimental.vector.reduce.v2.fmul.f64.v2f64(double %start_value, <2 x double> %a)
13806
13807 Overview:
13808 """""""""
13809
13810 The '``llvm.experimental.vector.reduce.v2.fmul.*``' intrinsics do a floating-point
1381413811 ``MUL`` reduction of a vector, returning the result as a scalar. The return type
1381513812 matches the element-type of the vector input.
1381613813
13817 If the intrinsic call has fast-math flags, then the reduction will not preserve
13818 the associativity of an equivalent scalarized counterpart. If it does not have
13819 fast-math flags, then the reduction will be *ordered*, implying that the
13820 operation respects the associativity of a scalarized reduction.
13821
13822
13823 Arguments:
13824 """"""""""
13825 The first argument to this intrinsic is a scalar accumulator value, which is
13826 only used when there are no fast-math flags attached. This argument may be undef
13827 when fast-math flags are used. The type of the accumulator matches the
13828 element-type of the vector input.
13829
13814 If the intrinsic call has the 'reassoc' or 'fast' flags set, then the
13815 reduction will not preserve the associativity of an equivalent scalarized
13816 counterpart. Otherwise the reduction will be *ordered*, thus implying that
13817 the operation respects the associativity of a scalarized reduction.
13818
13819
13820 Arguments:
13821 """"""""""
13822 The first argument to this intrinsic is a scalar start value for the reduction.
13823 The type of the start value matches the element-type of the vector input.
1383013824 The second argument must be a vector of floating-point values.
1383113825
1383213826 Examples:
1383413828
1383513829 .. code-block:: llvm
1383613830
13837 %fast = call fast float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float undef, <4 x float> %input) ; fast reduction
13838 %ord = call float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float %acc, <4 x float> %input) ; ordered reduction
13831 %unord = call reassoc float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.0, <4 x float> %input) ; unordered reduction
13832 %ord = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float %start_value, <4 x float> %input) ; ordered reduction
1383913833
1384013834 '``llvm.experimental.vector.reduce.and.*``' Intrinsic
1384113835 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
10691069 case Intrinsic::experimental_vector_reduce_and:
10701070 case Intrinsic::experimental_vector_reduce_or:
10711071 case Intrinsic::experimental_vector_reduce_xor:
1072 case Intrinsic::experimental_vector_reduce_fadd:
1073 case Intrinsic::experimental_vector_reduce_fmul:
1072 case Intrinsic::experimental_vector_reduce_v2_fadd:
1073 case Intrinsic::experimental_vector_reduce_v2_fmul:
10741074 case Intrinsic::experimental_vector_reduce_smax:
10751075 case Intrinsic::experimental_vector_reduce_smin:
10761076 case Intrinsic::experimental_vector_reduce_fmax:
12601260 case Intrinsic::experimental_vector_reduce_xor:
12611261 return ConcreteTTI->getArithmeticReductionCost(Instruction::Xor, Tys[0],
12621262 /*IsPairwiseForm=*/false);
1263 case Intrinsic::experimental_vector_reduce_fadd:
1264 return ConcreteTTI->getArithmeticReductionCost(Instruction::FAdd, Tys[0],
1265 /*IsPairwiseForm=*/false);
1266 case Intrinsic::experimental_vector_reduce_fmul:
1267 return ConcreteTTI->getArithmeticReductionCost(Instruction::FMul, Tys[0],
1268 /*IsPairwiseForm=*/false);
1263 case Intrinsic::experimental_vector_reduce_v2_fadd:
1264 return ConcreteTTI->getArithmeticReductionCost(
1265 Instruction::FAdd, Tys[0],
1266 /*IsPairwiseForm=*/false); // FIXME: Add new flag for cost of strict
1267 // reductions.
1268 case Intrinsic::experimental_vector_reduce_v2_fmul:
1269 return ConcreteTTI->getArithmeticReductionCost(
1270 Instruction::FMul, Tys[0],
1271 /*IsPairwiseForm=*/false); // FIXME: Add new flag for cost of strict
1272 // reductions.
12691273 case Intrinsic::experimental_vector_reduce_smax:
12701274 case Intrinsic::experimental_vector_reduce_smin:
12711275 case Intrinsic::experimental_vector_reduce_fmax:
11391139
11401140 //===------------------------ Reduction Intrinsics ------------------------===//
11411141 //
1142 def int_experimental_vector_reduce_fadd : Intrinsic<[llvm_anyfloat_ty],
1143 [LLVMMatchType<0>,
1144 llvm_anyvector_ty],
1145 [IntrNoMem]>;
1146 def int_experimental_vector_reduce_fmul : Intrinsic<[llvm_anyfloat_ty],
1147 [LLVMMatchType<0>,
1148 llvm_anyvector_ty],
1149 [IntrNoMem]>;
1142 def int_experimental_vector_reduce_v2_fadd : Intrinsic<[llvm_anyfloat_ty],
1143 [LLVMMatchType<0>,
1144 llvm_anyvector_ty],
1145 [IntrNoMem]>;
1146 def int_experimental_vector_reduce_v2_fmul : Intrinsic<[llvm_anyfloat_ty],
1147 [LLVMMatchType<0>,
1148 llvm_anyvector_ty],
1149 [IntrNoMem]>;
11501150 def int_experimental_vector_reduce_add : Intrinsic<[llvm_anyint_ty],
11511151 [llvm_anyvector_ty],
11521152 [IntrNoMem]>;
2828
2929 unsigned getOpcode(Intrinsic::ID ID) {
3030 switch (ID) {
31 case Intrinsic::experimental_vector_reduce_fadd:
31 case Intrinsic::experimental_vector_reduce_v2_fadd:
3232 return Instruction::FAdd;
33 case Intrinsic::experimental_vector_reduce_fmul:
33 case Intrinsic::experimental_vector_reduce_v2_fmul:
3434 return Instruction::FMul;
3535 case Intrinsic::experimental_vector_reduce_add:
3636 return Instruction::Add;
8282 Worklist.push_back(II);
8383
8484 for (auto *II : Worklist) {
85 if (!TTI->shouldExpandReduction(II))
86 continue;
87
88 FastMathFlags FMF =
89 isa(II) ? II->getFastMathFlags() : FastMathFlags{};
90 Intrinsic::ID ID = II->getIntrinsicID();
91 RecurrenceDescriptor::MinMaxRecurrenceKind MRK = getMRK(ID);
92
93 Value *Rdx = nullptr;
8594 IRBuilder<> Builder(II);
86 bool IsOrdered = false;
87 Value *Acc = nullptr;
88 Value *Vec = nullptr;
89 auto ID = II->getIntrinsicID();
90 auto MRK = RecurrenceDescriptor::MRK_Invalid;
95 IRBuilder<>::FastMathFlagGuard FMFGuard(Builder);
96 Builder.setFastMathFlags(FMF);
9197 switch (ID) {
92 case Intrinsic::experimental_vector_reduce_fadd:
93 case Intrinsic::experimental_vector_reduce_fmul:
98 case Intrinsic::experimental_vector_reduce_v2_fadd:
99 case Intrinsic::experimental_vector_reduce_v2_fmul: {
94100 // FMFs must be attached to the call, otherwise it's an ordered reduction
95101 // and it can't be handled by generating a shuffle sequence.
96 if (!II->getFastMathFlags().isFast())
97 IsOrdered = true;
98 Acc = II->getArgOperand(0);
99 Vec = II->getArgOperand(1);
100 break;
102 Value *Acc = II->getArgOperand(0);
103 Value *Vec = II->getArgOperand(1);
104 if (!FMF.allowReassoc())
105 Rdx = getOrderedReduction(Builder, Acc, Vec, getOpcode(ID), MRK);
106 else {
107 Rdx = getShuffleReduction(Builder, Vec, getOpcode(ID), MRK);
108 Rdx = Builder.CreateBinOp((Instruction::BinaryOps)getOpcode(ID),
109 Acc, Rdx, "bin.rdx");
110 }
111 } break;
101112 case Intrinsic::experimental_vector_reduce_add:
102113 case Intrinsic::experimental_vector_reduce_mul:
103114 case Intrinsic::experimental_vector_reduce_and:
108119 case Intrinsic::experimental_vector_reduce_umax:
109120 case Intrinsic::experimental_vector_reduce_umin:
110121 case Intrinsic::experimental_vector_reduce_fmax:
111 case Intrinsic::experimental_vector_reduce_fmin:
112 Vec = II->getArgOperand(0);
113 MRK = getMRK(ID);
114 break;
122 case Intrinsic::experimental_vector_reduce_fmin: {
123 Value *Vec = II->getArgOperand(0);
124 Rdx = getShuffleReduction(Builder, Vec, getOpcode(ID), MRK);
125 } break;
115126 default:
116127 continue;
117128 }
118 if (!TTI->shouldExpandReduction(II))
119 continue;
120 // Propagate FMF using the builder.
121 FastMathFlags FMF =
122 isa(II) ? II->getFastMathFlags() : FastMathFlags{};
123 IRBuilder<>::FastMathFlagGuard FMFGuard(Builder);
124 Builder.setFastMathFlags(FMF);
125 Value *Rdx =
126 IsOrdered ? getOrderedReduction(Builder, Acc, Vec, getOpcode(ID), MRK)
127 : getShuffleReduction(Builder, Vec, getOpcode(ID), MRK);
128129 II->replaceAllUsesWith(Rdx);
129130 II->eraseFromParent();
130131 Changed = true;
67356735 LowerDeoptimizeCall(&I);
67366736 return;
67376737
6738 case Intrinsic::experimental_vector_reduce_fadd:
6739 case Intrinsic::experimental_vector_reduce_fmul:
6738 case Intrinsic::experimental_vector_reduce_v2_fadd:
6739 case Intrinsic::experimental_vector_reduce_v2_fmul:
67406740 case Intrinsic::experimental_vector_reduce_add:
67416741 case Intrinsic::experimental_vector_reduce_mul:
67426742 case Intrinsic::experimental_vector_reduce_and:
87948794 FMF = I.getFastMathFlags();
87958795
87968796 switch (Intrinsic) {
8797 case Intrinsic::experimental_vector_reduce_fadd:
8798 if (FMF.isFast())
8799 Res = DAG.getNode(ISD::VECREDUCE_FADD, dl, VT, Op2);
8797 case Intrinsic::experimental_vector_reduce_v2_fadd:
8798 if (FMF.allowReassoc())
8799 Res = DAG.getNode(ISD::FADD, dl, VT, Op1,
8800 DAG.getNode(ISD::VECREDUCE_FADD, dl, VT, Op2));
88008801 else
88018802 Res = DAG.getNode(ISD::VECREDUCE_STRICT_FADD, dl, VT, Op1, Op2);
88028803 break;
8803 case Intrinsic::experimental_vector_reduce_fmul:
8804 if (FMF.isFast())
8805 Res = DAG.getNode(ISD::VECREDUCE_FMUL, dl, VT, Op2);
8804 case Intrinsic::experimental_vector_reduce_v2_fmul:
8805 if (FMF.allowReassoc())
8806 Res = DAG.getNode(ISD::FMUL, dl, VT, Op1,
8807 DAG.getNode(ISD::VECREDUCE_FMUL, dl, VT, Op2));
88068808 else
88078809 Res = DAG.getNode(ISD::VECREDUCE_STRICT_FMUL, dl, VT, Op1, Op2);
88088810 break;
601601 }
602602 break;
603603 }
604 case 'e': {
605 SmallVector Groups;
606 Regex R("^experimental.vector.reduce.([a-z]+)\\.[fi][0-9]+");
607 if (R.match(Name, &Groups)) {
608 Intrinsic::ID ID = Intrinsic::not_intrinsic;
609 if (Groups[1] == "fadd")
610 ID = Intrinsic::experimental_vector_reduce_v2_fadd;
611 if (Groups[1] == "fmul")
612 ID = Intrinsic::experimental_vector_reduce_v2_fmul;
613
614 if (ID != Intrinsic::not_intrinsic) {
615 rename(F);
616 auto Args = F->getFunctionType()->params();
617 Type *Tys[] = {F->getFunctionType()->getReturnType(), Args[1]};
618 NewFn = Intrinsic::getDeclaration(F->getParent(), ID, Tys);
619 return true;
620 }
621 }
622 break;
623 }
604624 case 'i':
605625 case 'l': {
606626 bool IsLifetimeStart = Name.startswith("lifetime.start");
34663486 DefaultCase();
34673487 return;
34683488 }
3469
3489 case Intrinsic::experimental_vector_reduce_v2_fmul: {
3490 SmallVector Args;
3491 if (CI->isFast())
3492 Args.push_back(ConstantFP::get(CI->getOperand(0)->getType(), 1.0));
3493 else
3494 Args.push_back(CI->getOperand(0));
3495 Args.push_back(CI->getOperand(1));
3496 NewCall = Builder.CreateCall(NewFn, Args);
3497 cast(NewCall)->copyFastMathFlags(CI);
3498 break;
3499 }
3500 case Intrinsic::experimental_vector_reduce_v2_fadd: {
3501 SmallVector Args;
3502 if (CI->isFast())
3503 Args.push_back(Constant::getNullValue(CI->getOperand(0)->getType()));
3504 else
3505 Args.push_back(CI->getOperand(0));
3506 Args.push_back(CI->getOperand(1));
3507 NewCall = Builder.CreateCall(NewFn, Args);
3508 cast(NewCall)->copyFastMathFlags(CI);
3509 break;
3510 }
34703511 case Intrinsic::arm_neon_vld1:
34713512 case Intrinsic::arm_neon_vld2:
34723513 case Intrinsic::arm_neon_vld3:
322322 Value *Ops[] = {Acc, Src};
323323 Type *Tys[] = {Acc->getType(), Src->getType()};
324324 auto Decl = Intrinsic::getDeclaration(
325 M, Intrinsic::experimental_vector_reduce_fadd, Tys);
325 M, Intrinsic::experimental_vector_reduce_v2_fadd, Tys);
326326 return createCallHelper(Decl, Ops, this);
327327 }
328328
331331 Value *Ops[] = {Acc, Src};
332332 Type *Tys[] = {Acc->getType(), Src->getType()};
333333 auto Decl = Intrinsic::getDeclaration(
334 M, Intrinsic::experimental_vector_reduce_fmul, Tys);
334 M, Intrinsic::experimental_vector_reduce_v2_fmul, Tys);
335335 return createCallHelper(Decl, Ops, this);
336336 }
337337
800800 ArrayRef RedOps) {
801801 assert(isa(Src->getType()) && "Type must be a vector");
802802
803 Value *ScalarUdf = UndefValue::get(Src->getType()->getVectorElementType());
804803 std::function BuildFunc;
805804 using RD = RecurrenceDescriptor;
806805 RD::MinMaxRecurrenceKind MinMaxKind = RD::MRK_Invalid;
807 // TODO: Support creating ordered reductions.
808 FastMathFlags FMFFast;
809 FMFFast.setFast();
810806
811807 switch (Opcode) {
812808 case Instruction::Add:
826822 break;
827823 case Instruction::FAdd:
828824 BuildFunc = [&]() {
829 auto Rdx = Builder.CreateFAddReduce(ScalarUdf, Src);
830 cast(Rdx)->setFastMathFlags(FMFFast);
825 auto Rdx = Builder.CreateFAddReduce(
826 Constant::getNullValue(Src->getType()->getVectorElementType()), Src);
831827 return Rdx;
832828 };
833829 break;
834830 case Instruction::FMul:
835831 BuildFunc = [&]() {
836 auto Rdx = Builder.CreateFMulReduce(ScalarUdf, Src);
837 cast(Rdx)->setFastMathFlags(FMFFast);
832 Type *Ty = Src->getType()->getVectorElementType();
833 auto Rdx = Builder.CreateFMulReduce(ConstantFP::get(Ty, 1.0), Src);
838834 return Rdx;
839835 };
840836 break;
0 ; RUN: not opt -S < %s 2>&1 | FileCheck %s
11
22 ; CHECK: Intrinsic has incorrect argument type!
3 ; CHECK-NEXT: float (double, <2 x double>)* @llvm.experimental.vector.reduce.fadd.f32.f64.v2f64
3 ; CHECK-NEXT: float (double, <2 x double>)* @llvm.experimental.vector.reduce.v2.fadd.f32.f64.v2f64
44 define float @fadd_invalid_scalar_res(double %acc, <2 x double> %in) {
5 %res = call float @llvm.experimental.vector.reduce.fadd.f32.f64.v2f64(double %acc, <2 x double> %in)
5 %res = call float @llvm.experimental.vector.reduce.v2.fadd.f32.f64.v2f64(double %acc, <2 x double> %in)
66 ret float %res
77 }
88
99 ; CHECK: Intrinsic has incorrect argument type!
10 ; CHECK-NEXT: double (float, <2 x double>)* @llvm.experimental.vector.reduce.fadd.f64.f32.v2f64
10 ; CHECK-NEXT: double (float, <2 x double>)* @llvm.experimental.vector.reduce.v2.fadd.f64.f32.v2f64
1111 define double @fadd_invalid_scalar_start(float %acc, <2 x double> %in) {
12 %res = call double @llvm.experimental.vector.reduce.fadd.f64.f32.v2f64(float %acc, <2 x double> %in)
12 %res = call double @llvm.experimental.vector.reduce.v2.fadd.f64.f32.v2f64(float %acc, <2 x double> %in)
1313 ret double %res
1414 }
1515
1616 ; CHECK: Intrinsic has incorrect argument type!
17 ; CHECK-NEXT: <2 x double> (double, <2 x double>)* @llvm.experimental.vector.reduce.fadd.v2f64.f64.v2f64
17 ; CHECK-NEXT: <2 x double> (double, <2 x double>)* @llvm.experimental.vector.reduce.v2.fadd.v2f64.f64.v2f64
1818 define <2 x double> @fadd_invalid_vector_res(double %acc, <2 x double> %in) {
19 %res = call <2 x double> @llvm.experimental.vector.reduce.fadd.v2f64.f64.v2f64(double %acc, <2 x double> %in)
19 %res = call <2 x double> @llvm.experimental.vector.reduce.v2.fadd.v2f64.f64.v2f64(double %acc, <2 x double> %in)
2020 ret <2 x double> %res
2121 }
2222
2323 ; CHECK: Intrinsic has incorrect argument type!
24 ; CHECK-NEXT: double (<2 x double>, <2 x double>)* @llvm.experimental.vector.reduce.fadd.f64.v2f64.v2f64
24 ; CHECK-NEXT: double (<2 x double>, <2 x double>)* @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64.v2f64
2525 define double @fadd_invalid_vector_start(<2 x double> %in, <2 x double> %acc) {
26 %res = call double @llvm.experimental.vector.reduce.fadd.f64.v2f64.v2f64(<2 x double> %acc, <2 x double> %in)
26 %res = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64.v2f64(<2 x double> %acc, <2 x double> %in)
2727 ret double %res
2828 }
2929
30 declare float @llvm.experimental.vector.reduce.fadd.f32.f64.v2f64(double %acc, <2 x double> %in)
31 declare double @llvm.experimental.vector.reduce.fadd.f64.f32.v2f64(float %acc, <2 x double> %in)
32 declare double @llvm.experimental.vector.reduce.fadd.f64.v2f64.v2f64(<2 x double> %acc, <2 x double> %in)
33 declare <2 x double> @llvm.experimental.vector.reduce.fadd.v2f64.f64.v2f64(double %acc, <2 x double> %in)
30 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.f64.v2f64(double %acc, <2 x double> %in)
31 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.f32.v2f64(float %acc, <2 x double> %in)
32 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64.v2f64(<2 x double> %acc, <2 x double> %in)
33 declare <2 x double> @llvm.experimental.vector.reduce.v2.fadd.v2f64.f64.v2f64(double %acc, <2 x double> %in)
0 ; RUN: opt -S < %s | FileCheck %s
1 ; RUN: llvm-dis < %s.bc | FileCheck %s
2
3 define float @fadd_acc(<4 x float> %in, float %acc) {
4 ; CHECK-LABEL: @fadd_acc
5 ; CHECK: %res = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %acc, <4 x float> %in)
6 %res = call float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float %acc, <4 x float> %in)
7 ret float %res
8 }
9
10 define float @fadd_undef(<4 x float> %in) {
11 ; CHECK-LABEL: @fadd_undef
12 ; CHECK: %res = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float undef, <4 x float> %in)
13 %res = call float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %in)
14 ret float %res
15 }
16
17 define float @fadd_fast_acc(<4 x float> %in, float %acc) {
18 ; CHECK-LABEL: @fadd_fast_acc
19 ; CHECK: %res = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.000000e+00, <4 x float> %in)
20 %res = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float %acc, <4 x float> %in)
21 ret float %res
22 }
23
24 define float @fadd_fast_undef(<4 x float> %in) {
25 ; CHECK-LABEL: @fadd_fast_undef
26 ; CHECK: %res = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.000000e+00, <4 x float> %in)
27 %res = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %in)
28 ret float %res
29 }
30
31 define float @fmul_acc(<4 x float> %in, float %acc) {
32 ; CHECK-LABEL: @fmul_acc
33 ; CHECK: %res = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float %acc, <4 x float> %in)
34 %res = call float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float %acc, <4 x float> %in)
35 ret float %res
36 }
37
38 define float @fmul_undef(<4 x float> %in) {
39 ; CHECK-LABEL: @fmul_undef
40 ; CHECK: %res = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float undef, <4 x float> %in)
41 %res = call float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float undef, <4 x float> %in)
42 ret float %res
43 }
44
45 define float @fmul_fast_acc(<4 x float> %in, float %acc) {
46 ; CHECK-LABEL: @fmul_fast_acc
47 ; CHECK: %res = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.000000e+00, <4 x float> %in)
48 %res = call fast float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float %acc, <4 x float> %in)
49 ret float %res
50 }
51
52 define float @fmul_fast_undef(<4 x float> %in) {
53 ; CHECK-LABEL: @fmul_fast_undef
54 ; CHECK: %res = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.000000e+00, <4 x float> %in)
55 %res = call fast float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float undef, <4 x float> %in)
56 ret float %res
57 }
58
59 declare float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float, <4 x float>)
60 ; CHECK: declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float, <4 x float>)
61
62 declare float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float, <4 x float>)
63 ; CHECK: declare float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float, <4 x float>)
0 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
11 ; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -mattr=+neon | FileCheck %s --check-prefix=CHECK
22
3 declare half @llvm.experimental.vector.reduce.fadd.f16.v1f16(half, <1 x half>)
4 declare float @llvm.experimental.vector.reduce.fadd.f32.v1f32(float, <1 x float>)
5 declare double @llvm.experimental.vector.reduce.fadd.f64.v1f64(double, <1 x double>)
6 declare fp128 @llvm.experimental.vector.reduce.fadd.f128.v1f128(fp128, <1 x fp128>)
3 declare half @llvm.experimental.vector.reduce.v2.fadd.f16.v1f16(half, <1 x half>)
4 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v1f32(float, <1 x float>)
5 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v1f64(double, <1 x double>)
6 declare fp128 @llvm.experimental.vector.reduce.v2.fadd.f128.v1f128(fp128, <1 x fp128>)
77
8 declare float @llvm.experimental.vector.reduce.fadd.f32.v3f32(float, <3 x float>)
9 declare fp128 @llvm.experimental.vector.reduce.fadd.f128.v2f128(fp128, <2 x fp128>)
10 declare float @llvm.experimental.vector.reduce.fadd.f32.v16f32(float, <16 x float>)
8 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v3f32(float, <3 x float>)
9 declare fp128 @llvm.experimental.vector.reduce.v2.fadd.f128.v2f128(fp128, <2 x fp128>)
10 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v16f32(float, <16 x float>)
1111
1212 define half @test_v1f16(<1 x half> %a) nounwind {
1313 ; CHECK-LABEL: test_v1f16:
1414 ; CHECK: // %bb.0:
1515 ; CHECK-NEXT: ret
16 %b = call fast nnan half @llvm.experimental.vector.reduce.fadd.f16.v1f16(half 0.0, <1 x half> %a)
16 %b = call fast nnan half @llvm.experimental.vector.reduce.v2.fadd.f16.v1f16(half 0.0, <1 x half> %a)
1717 ret half %b
1818 }
1919
2323 ; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0
2424 ; CHECK-NEXT: // kill: def $s0 killed $s0 killed $q0
2525 ; CHECK-NEXT: ret
26 %b = call fast nnan float @llvm.experimental.vector.reduce.fadd.f32.v1f32(float 0.0, <1 x float> %a)
26 %b = call fast nnan float @llvm.experimental.vector.reduce.v2.fadd.f32.v1f32(float 0.0, <1 x float> %a)
2727 ret float %b
2828 }
2929
3131 ; CHECK-LABEL: test_v1f64:
3232 ; CHECK: // %bb.0:
3333 ; CHECK-NEXT: ret
34 %b = call fast nnan double @llvm.experimental.vector.reduce.fadd.f64.v1f64(double 0.0, <1 x double> %a)
34 %b = call fast nnan double @llvm.experimental.vector.reduce.v2.fadd.f64.v1f64(double 0.0, <1 x double> %a)
3535 ret double %b
3636 }
3737
3939 ; CHECK-LABEL: test_v1f128:
4040 ; CHECK: // %bb.0:
4141 ; CHECK-NEXT: ret
42 %b = call fast nnan fp128 @llvm.experimental.vector.reduce.fadd.f128.v1f128(fp128 zeroinitializer, <1 x fp128> %a)
42 %b = call fast nnan fp128 @llvm.experimental.vector.reduce.v2.fadd.f128.v1f128(fp128 zeroinitializer, <1 x fp128> %a)
4343 ret fp128 %b
4444 }
4545
5252 ; CHECK-NEXT: fadd v0.2s, v0.2s, v1.2s
5353 ; CHECK-NEXT: faddp s0, v0.2s
5454 ; CHECK-NEXT: ret
55 %b = call fast nnan float @llvm.experimental.vector.reduce.fadd.f32.v3f32(float 0.0, <3 x float> %a)
55 %b = call fast nnan float @llvm.experimental.vector.reduce.v2.fadd.f32.v3f32(float 0.0, <3 x float> %a)
5656 ret float %b
5757 }
5858
6363 ; CHECK-NEXT: bl __addtf3
6464 ; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload
6565 ; CHECK-NEXT: ret
66 %b = call fast nnan fp128 @llvm.experimental.vector.reduce.fadd.f128.v2f128(fp128 zeroinitializer, <2 x fp128> %a)
66 %b = call fast nnan fp128 @llvm.experimental.vector.reduce.v2.fadd.f128.v2f128(fp128 zeroinitializer, <2 x fp128> %a)
6767 ret fp128 %b
6868 }
6969
7777 ; CHECK-NEXT: fadd v0.2s, v0.2s, v1.2s
7878 ; CHECK-NEXT: faddp s0, v0.2s
7979 ; CHECK-NEXT: ret
80 %b = call fast nnan float @llvm.experimental.vector.reduce.fadd.f32.v16f32(float 0.0, <16 x float> %a)
80 %b = call fast nnan float @llvm.experimental.vector.reduce.v2.fadd.f32.v16f32(float 0.0, <16 x float> %a)
8181 ret float %b
8282 }
44 ; CHECK-LABEL: add_HalfS:
55 ; CHECK: faddp s0, v0.2s
66 ; CHECK-NEXT: ret
7 %r = call fast float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float undef, <2 x float> %bin.rdx)
7 %r = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v2f32(float 0.0, <2 x float> %bin.rdx)
88 ret float %r
99 }
1010
2222 ; CHECKNOFP16-NOT: fadd h{{[0-9]+}}
2323 ; CHECKNOFP16-NOT: fadd v{{[0-9]+}}.{{[0-9]}}h
2424 ; CHECKNOFP16: ret
25 %r = call fast half @llvm.experimental.vector.reduce.fadd.f16.v4f16(half undef, <4 x half> %bin.rdx)
25 %r = call fast half @llvm.experimental.vector.reduce.v2.fadd.f16.v4f16(half 0.0, <4 x half> %bin.rdx)
2626 ret half %r
2727 }
2828
4444 ; CHECKNOFP16-NOT: fadd h{{[0-9]+}}
4545 ; CHECKNOFP16-NOT: fadd v{{[0-9]+}}.{{[0-9]}}h
4646 ; CHECKNOFP16: ret
47 %r = call fast half @llvm.experimental.vector.reduce.fadd.f16.v8f16(half undef, <8 x half> %bin.rdx)
47 %r = call fast half @llvm.experimental.vector.reduce.v2.fadd.f16.v8f16(half 0.0, <8 x half> %bin.rdx)
4848 ret half %r
4949 }
5050
5454 ; CHECK-NEXT: fadd v0.2s, v0.2s, v1.2s
5555 ; CHECK-NEXT: faddp s0, v0.2s
5656 ; CHECK-NEXT: ret
57 %r = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %bin.rdx)
57 %r = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.0, <4 x float> %bin.rdx)
5858 ret float %r
5959 }
6060
6262 ; CHECK-LABEL: add_D:
6363 ; CHECK: faddp d0, v0.2d
6464 ; CHECK-NEXT: ret
65 %r = call fast double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double undef, <2 x double> %bin.rdx)
65 %r = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64(double 0.0, <2 x double> %bin.rdx)
6666 ret double %r
6767 }
6868
8383 ; CHECKNOFP16-NOT: fadd h{{[0-9]+}}
8484 ; CHECKNOFP16-NOT: fadd v{{[0-9]+}}.{{[0-9]}}h
8585 ; CHECKNOFP16: ret
86 %r = call fast half @llvm.experimental.vector.reduce.fadd.f16.v16f16(half undef, <16 x half> %bin.rdx)
86 %r = call fast half @llvm.experimental.vector.reduce.v2.fadd.f16.v16f16(half 0.0, <16 x half> %bin.rdx)
8787 ret half %r
8888 }
8989
9494 ; CHECK-NEXT: fadd v0.2s, v0.2s, v1.2s
9595 ; CHECK-NEXT: faddp s0, v0.2s
9696 ; CHECK-NEXT: ret
97 %r = call fast float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float undef, <8 x float> %bin.rdx)
97 %r = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float 0.0, <8 x float> %bin.rdx)
9898 ret float %r
9999 }
100100
103103 ; CHECK: fadd v0.2d, v0.2d, v1.2d
104104 ; CHECK-NEXT: faddp d0, v0.2d
105105 ; CHECK-NEXT: ret
106 %r = call fast double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double undef, <4 x double> %bin.rdx)
106 %r = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double 0.0, <4 x double> %bin.rdx)
107107 ret double %r
108108 }
109109
110110 ; Function Attrs: nounwind readnone
111 declare half @llvm.experimental.vector.reduce.fadd.f16.v4f16(half, <4 x half>)
112 declare half @llvm.experimental.vector.reduce.fadd.f16.v8f16(half, <8 x half>)
113 declare half @llvm.experimental.vector.reduce.fadd.f16.v16f16(half, <16 x half>)
114 declare float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float, <2 x float>)
115 declare float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float, <4 x float>)
116 declare float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float, <8 x float>)
117 declare double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double, <2 x double>)
118 declare double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double, <4 x double>)
111 declare half @llvm.experimental.vector.reduce.v2.fadd.f16.v4f16(half, <4 x half>)
112 declare half @llvm.experimental.vector.reduce.v2.fadd.f16.v8f16(half, <8 x half>)
113 declare half @llvm.experimental.vector.reduce.v2.fadd.f16.v16f16(half, <16 x half>)
114 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v2f32(float, <2 x float>)
115 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float, <4 x float>)
116 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float, <8 x float>)
117 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64(double, <2 x double>)
118 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double, <4 x double>)
66 declare i64 @llvm.experimental.vector.reduce.or.i64.v2i64(<2 x i64>)
77 declare i64 @llvm.experimental.vector.reduce.xor.i64.v2i64(<2 x i64>)
88
9 declare float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float, <4 x float>)
10 declare float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float, <4 x float>)
9 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float, <4 x float>)
10 declare float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float, <4 x float>)
1111
1212 declare i64 @llvm.experimental.vector.reduce.smax.i64.v2i64(<2 x i64>)
1313 declare i64 @llvm.experimental.vector.reduce.smin.i64.v2i64(<2 x i64>)
9191 ; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32>
9292 ; CHECK-NEXT: [[BIN_RDX2:%.*]] = fadd fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]
9393 ; CHECK-NEXT: [[TMP0:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
94 ; CHECK-NEXT: ret float [[TMP0]]
95 ;
96 entry:
97 %r = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %vec)
94 ; CHECK-NEXT: [[TMP1:%.*]] = fadd fast float 0.000000e+00, [[TMP0]]
95 ; CHECK-NEXT: ret float [[TMP1]]
96 ;
97 entry:
98 %r = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.0, <4 x float> %vec)
9899 ret float %r
99100 }
100101
106107 ; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32>
107108 ; CHECK-NEXT: [[BIN_RDX2:%.*]] = fadd fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]
108109 ; CHECK-NEXT: [[TMP0:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
109 ; CHECK-NEXT: ret float [[TMP0]]
110 ;
111 entry:
112 %r = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float %accum, <4 x float> %vec)
110 ; CHECK-NEXT: [[TMP1:%.*]] = fadd fast float %accum, [[TMP0]]
111 ; CHECK-NEXT: ret float [[TMP1]]
112 ;
113 entry:
114 %r = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %accum, <4 x float> %vec)
113115 ret float %r
114116 }
115117
127129 ; CHECK-NEXT: ret float [[BIN_RDX3]]
128130 ;
129131 entry:
130 %r = call float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %vec)
132 %r = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float undef, <4 x float> %vec)
131133 ret float %r
132134 }
133135
145147 ; CHECK-NEXT: ret float [[BIN_RDX3]]
146148 ;
147149 entry:
148 %r = call float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float %accum, <4 x float> %vec)
150 %r = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %accum, <4 x float> %vec)
149151 ret float %r
150152 }
151153
157159 ; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32>
158160 ; CHECK-NEXT: [[BIN_RDX2:%.*]] = fmul fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]
159161 ; CHECK-NEXT: [[TMP0:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
160 ; CHECK-NEXT: ret float [[TMP0]]
161 ;
162 entry:
163 %r = call fast float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float undef, <4 x float> %vec)
162 ; CHECK-NEXT: [[TMP1:%.*]] = fmul fast float 1.000000e+00, [[TMP0]]
163 ; CHECK-NEXT: ret float [[TMP1]]
164 ;
165 entry:
166 %r = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.0, <4 x float> %vec)
164167 ret float %r
165168 }
166169
172175 ; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32>
173176 ; CHECK-NEXT: [[BIN_RDX2:%.*]] = fmul fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]
174177 ; CHECK-NEXT: [[TMP0:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
175 ; CHECK-NEXT: ret float [[TMP0]]
176 ;
177 entry:
178 %r = call fast float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float %accum, <4 x float> %vec)
178 ; CHECK-NEXT: [[TMP1:%.*]] = fmul fast float %accum, [[TMP0]]
179 ; CHECK-NEXT: ret float [[TMP1]]
180 ;
181 entry:
182 %r = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float %accum, <4 x float> %vec)
179183 ret float %r
180184 }
181185
193197 ; CHECK-NEXT: ret float [[BIN_RDX3]]
194198 ;
195199 entry:
196 %r = call float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float undef, <4 x float> %vec)
200 %r = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float undef, <4 x float> %vec)
197201 ret float %r
198202 }
199203
211215 ; CHECK-NEXT: ret float [[BIN_RDX3]]
212216 ;
213217 entry:
214 %r = call float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float %accum, <4 x float> %vec)
218 %r = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float %accum, <4 x float> %vec)
215219 ret float %r
216220 }
217221
16271627 ; Repeat tests from general reductions to verify output for hoppy targets:
16281628 ; PR38971: https://bugs.llvm.org/show_bug.cgi?id=38971
16291629
1630 declare float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float, <8 x float>)
1631 declare double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double, <4 x double>)
1630 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float, <8 x float>)
1631 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double, <4 x double>)
16321632
16331633 define float @fadd_reduce_v8f32(float %a0, <8 x float> %a1) {
16341634 ; SSE3-SLOW-LABEL: fadd_reduce_v8f32:
16371637 ; SSE3-SLOW-NEXT: movaps %xmm1, %xmm2
16381638 ; SSE3-SLOW-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
16391639 ; SSE3-SLOW-NEXT: addps %xmm1, %xmm2
1640 ; SSE3-SLOW-NEXT: movshdup {{.*#+}} xmm0 = xmm2[1,1,3,3]
1641 ; SSE3-SLOW-NEXT: addss %xmm2, %xmm0
1640 ; SSE3-SLOW-NEXT: movshdup {{.*#+}} xmm1 = xmm2[1,1,3,3]
1641 ; SSE3-SLOW-NEXT: addss %xmm2, %xmm1
1642 ; SSE3-SLOW-NEXT: addss %xmm1, %xmm0
16421643 ; SSE3-SLOW-NEXT: retq
16431644 ;
16441645 ; SSE3-FAST-LABEL: fadd_reduce_v8f32:
16451646 ; SSE3-FAST: # %bb.0:
16461647 ; SSE3-FAST-NEXT: addps %xmm2, %xmm1
1647 ; SSE3-FAST-NEXT: movaps %xmm1, %xmm0
1648 ; SSE3-FAST-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1]
1649 ; SSE3-FAST-NEXT: addps %xmm1, %xmm0
1650 ; SSE3-FAST-NEXT: haddps %xmm0, %xmm0
1648 ; SSE3-FAST-NEXT: movaps %xmm1, %xmm2
1649 ; SSE3-FAST-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
1650 ; SSE3-FAST-NEXT: addps %xmm1, %xmm2
1651 ; SSE3-FAST-NEXT: haddps %xmm2, %xmm2
1652 ; SSE3-FAST-NEXT: addss %xmm2, %xmm0
16511653 ; SSE3-FAST-NEXT: retq
16521654 ;
16531655 ; AVX-SLOW-LABEL: fadd_reduce_v8f32:
16541656 ; AVX-SLOW: # %bb.0:
1655 ; AVX-SLOW-NEXT: vextractf128 $1, %ymm1, %xmm0
1656 ; AVX-SLOW-NEXT: vaddps %xmm0, %xmm1, %xmm0
1657 ; AVX-SLOW-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
1658 ; AVX-SLOW-NEXT: vaddps %xmm1, %xmm0, %xmm0
1659 ; AVX-SLOW-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
1657 ; AVX-SLOW-NEXT: vextractf128 $1, %ymm1, %xmm2
1658 ; AVX-SLOW-NEXT: vaddps %xmm2, %xmm1, %xmm1
1659 ; AVX-SLOW-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
1660 ; AVX-SLOW-NEXT: vaddps %xmm2, %xmm1, %xmm1
1661 ; AVX-SLOW-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
1662 ; AVX-SLOW-NEXT: vaddss %xmm2, %xmm1, %xmm1
16601663 ; AVX-SLOW-NEXT: vaddss %xmm1, %xmm0, %xmm0
16611664 ; AVX-SLOW-NEXT: vzeroupper
16621665 ; AVX-SLOW-NEXT: retq
16631666 ;
16641667 ; AVX-FAST-LABEL: fadd_reduce_v8f32:
16651668 ; AVX-FAST: # %bb.0:
1666 ; AVX-FAST-NEXT: vextractf128 $1, %ymm1, %xmm0
1667 ; AVX-FAST-NEXT: vaddps %xmm0, %xmm1, %xmm0
1668 ; AVX-FAST-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
1669 ; AVX-FAST-NEXT: vaddps %xmm1, %xmm0, %xmm0
1670 ; AVX-FAST-NEXT: vhaddps %xmm0, %xmm0, %xmm0
1671 ; AVX-FAST-NEXT: vzeroupper
1672 ; AVX-FAST-NEXT: retq
1673 %r = call fast float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float %a0, <8 x float> %a1)
1669 ; AVX-FAST-NEXT: vextractf128 $1, %ymm1, %xmm2
1670 ; AVX-FAST-NEXT: vaddps %xmm2, %xmm1, %xmm1
1671 ; AVX-FAST-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
1672 ; AVX-FAST-NEXT: vaddps %xmm2, %xmm1, %xmm1
1673 ; AVX-FAST-NEXT: vhaddps %xmm1, %xmm1, %xmm1
1674 ; AVX-FAST-NEXT: vaddss %xmm1, %xmm0, %xmm0
1675 ; AVX-FAST-NEXT: vzeroupper
1676 ; AVX-FAST-NEXT: retq
1677 %r = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float %a0, <8 x float> %a1)
16741678 ret float %r
16751679 }
16761680
16781682 ; SSE3-SLOW-LABEL: fadd_reduce_v4f64:
16791683 ; SSE3-SLOW: # %bb.0:
16801684 ; SSE3-SLOW-NEXT: addpd %xmm2, %xmm1
1681 ; SSE3-SLOW-NEXT: movapd %xmm1, %xmm0
1682 ; SSE3-SLOW-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1]
1683 ; SSE3-SLOW-NEXT: addsd %xmm1, %xmm0
1685 ; SSE3-SLOW-NEXT: movapd %xmm1, %xmm2
1686 ; SSE3-SLOW-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
1687 ; SSE3-SLOW-NEXT: addsd %xmm1, %xmm2
1688 ; SSE3-SLOW-NEXT: addsd %xmm2, %xmm0
16841689 ; SSE3-SLOW-NEXT: retq
16851690 ;
16861691 ; SSE3-FAST-LABEL: fadd_reduce_v4f64:
16871692 ; SSE3-FAST: # %bb.0:
1688 ; SSE3-FAST-NEXT: movapd %xmm1, %xmm0
1689 ; SSE3-FAST-NEXT: addpd %xmm2, %xmm0
1690 ; SSE3-FAST-NEXT: haddpd %xmm0, %xmm0
1693 ; SSE3-FAST-NEXT: addpd %xmm2, %xmm1
1694 ; SSE3-FAST-NEXT: haddpd %xmm1, %xmm1
1695 ; SSE3-FAST-NEXT: addsd %xmm1, %xmm0
16911696 ; SSE3-FAST-NEXT: retq
16921697 ;
16931698 ; AVX-SLOW-LABEL: fadd_reduce_v4f64:
16941699 ; AVX-SLOW: # %bb.0:
1695 ; AVX-SLOW-NEXT: vextractf128 $1, %ymm1, %xmm0
1696 ; AVX-SLOW-NEXT: vaddpd %xmm0, %xmm1, %xmm0
1697 ; AVX-SLOW-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
1700 ; AVX-SLOW-NEXT: vextractf128 $1, %ymm1, %xmm2
1701 ; AVX-SLOW-NEXT: vaddpd %xmm2, %xmm1, %xmm1
1702 ; AVX-SLOW-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
1703 ; AVX-SLOW-NEXT: vaddsd %xmm2, %xmm1, %xmm1
16981704 ; AVX-SLOW-NEXT: vaddsd %xmm1, %xmm0, %xmm0
16991705 ; AVX-SLOW-NEXT: vzeroupper
17001706 ; AVX-SLOW-NEXT: retq
17011707 ;
17021708 ; AVX-FAST-LABEL: fadd_reduce_v4f64:
17031709 ; AVX-FAST: # %bb.0:
1704 ; AVX-FAST-NEXT: vextractf128 $1, %ymm1, %xmm0
1705 ; AVX-FAST-NEXT: vaddpd %xmm0, %xmm1, %xmm0
1706 ; AVX-FAST-NEXT: vhaddpd %xmm0, %xmm0, %xmm0
1707 ; AVX-FAST-NEXT: vzeroupper
1708 ; AVX-FAST-NEXT: retq
1709 %r = call fast double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double %a0, <4 x double> %a1)
1710 ; AVX-FAST-NEXT: vextractf128 $1, %ymm1, %xmm2
1711 ; AVX-FAST-NEXT: vaddpd %xmm2, %xmm1, %xmm1
1712 ; AVX-FAST-NEXT: vhaddpd %xmm1, %xmm1, %xmm1
1713 ; AVX-FAST-NEXT: vaddsd %xmm1, %xmm0, %xmm0
1714 ; AVX-FAST-NEXT: vzeroupper
1715 ; AVX-FAST-NEXT: retq
1716 %r = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double %a0, <4 x double> %a1)
17101717 ret double %r
17111718 }
17121719
1313 define float @test_v2f32(float %a0, <2 x float> %a1) {
1414 ; SSE2-LABEL: test_v2f32:
1515 ; SSE2: # %bb.0:
16 ; SSE2-NEXT: movaps %xmm1, %xmm0
17 ; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1],xmm1[2,3]
18 ; SSE2-NEXT: addss %xmm1, %xmm0
16 ; SSE2-NEXT: movaps %xmm1, %xmm2
17 ; SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[1,1],xmm1[2,3]
18 ; SSE2-NEXT: addss %xmm1, %xmm2
19 ; SSE2-NEXT: addss %xmm2, %xmm0
1920 ; SSE2-NEXT: retq
2021 ;
2122 ; SSE41-LABEL: test_v2f32:
2223 ; SSE41: # %bb.0:
23 ; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
24 ; SSE41-NEXT: addss %xmm1, %xmm0
24 ; SSE41-NEXT: movshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
25 ; SSE41-NEXT: addss %xmm1, %xmm2
26 ; SSE41-NEXT: addss %xmm2, %xmm0
2527 ; SSE41-NEXT: retq
2628 ;
2729 ; AVX1-SLOW-LABEL: test_v2f32:
2830 ; AVX1-SLOW: # %bb.0:
29 ; AVX1-SLOW-NEXT: vmovshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
30 ; AVX1-SLOW-NEXT: vaddss %xmm0, %xmm1, %xmm0
31 ; AVX1-SLOW-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
32 ; AVX1-SLOW-NEXT: vaddss %xmm2, %xmm1, %xmm1
33 ; AVX1-SLOW-NEXT: vaddss %xmm1, %xmm0, %xmm0
3134 ; AVX1-SLOW-NEXT: retq
3235 ;
3336 ; AVX1-FAST-LABEL: test_v2f32:
3437 ; AVX1-FAST: # %bb.0:
35 ; AVX1-FAST-NEXT: vhaddps %xmm1, %xmm1, %xmm0
38 ; AVX1-FAST-NEXT: vhaddps %xmm1, %xmm1, %xmm1
39 ; AVX1-FAST-NEXT: vaddss %xmm1, %xmm0, %xmm0
3640 ; AVX1-FAST-NEXT: retq
3741 ;
3842 ; AVX2-LABEL: test_v2f32:
3943 ; AVX2: # %bb.0:
40 ; AVX2-NEXT: vmovshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
41 ; AVX2-NEXT: vaddss %xmm0, %xmm1, %xmm0
44 ; AVX2-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
45 ; AVX2-NEXT: vaddss %xmm2, %xmm1, %xmm1
46 ; AVX2-NEXT: vaddss %xmm1, %xmm0, %xmm0
4247 ; AVX2-NEXT: retq
4348 ;
4449 ; AVX512-LABEL: test_v2f32:
4550 ; AVX512: # %bb.0:
46 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
47 ; AVX512-NEXT: vaddss %xmm0, %xmm1, %xmm0
48 ; AVX512-NEXT: retq
49 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float %a0, <2 x float> %a1)
51 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
52 ; AVX512-NEXT: vaddss %xmm2, %xmm1, %xmm1
53 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
54 ; AVX512-NEXT: retq
55 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v2f32(float %a0, <2 x float> %a1)
5056 ret float %1
5157 }
5258
5662 ; SSE2-NEXT: movaps %xmm1, %xmm2
5763 ; SSE2-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
5864 ; SSE2-NEXT: addps %xmm1, %xmm2
59 ; SSE2-NEXT: movaps %xmm2, %xmm0
60 ; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1],xmm2[2,3]
61 ; SSE2-NEXT: addss %xmm2, %xmm0
65 ; SSE2-NEXT: movaps %xmm2, %xmm1
66 ; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1],xmm2[2,3]
67 ; SSE2-NEXT: addss %xmm2, %xmm1
68 ; SSE2-NEXT: addss %xmm1, %xmm0
6269 ; SSE2-NEXT: retq
6370 ;
6471 ; SSE41-LABEL: test_v4f32:
6673 ; SSE41-NEXT: movaps %xmm1, %xmm2
6774 ; SSE41-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
6875 ; SSE41-NEXT: addps %xmm1, %xmm2
69 ; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm2[1,1,3,3]
70 ; SSE41-NEXT: addss %xmm2, %xmm0
76 ; SSE41-NEXT: movshdup {{.*#+}} xmm1 = xmm2[1,1,3,3]
77 ; SSE41-NEXT: addss %xmm2, %xmm1
78 ; SSE41-NEXT: addss %xmm1, %xmm0
7179 ; SSE41-NEXT: retq
7280 ;
7381 ; AVX1-SLOW-LABEL: test_v4f32:
7482 ; AVX1-SLOW: # %bb.0:
75 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm0 = xmm1[1,0]
76 ; AVX1-SLOW-NEXT: vaddps %xmm0, %xmm1, %xmm0
77 ; AVX1-SLOW-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
83 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
84 ; AVX1-SLOW-NEXT: vaddps %xmm2, %xmm1, %xmm1
85 ; AVX1-SLOW-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
86 ; AVX1-SLOW-NEXT: vaddss %xmm2, %xmm1, %xmm1
7887 ; AVX1-SLOW-NEXT: vaddss %xmm1, %xmm0, %xmm0
7988 ; AVX1-SLOW-NEXT: retq
8089 ;
8190 ; AVX1-FAST-LABEL: test_v4f32:
8291 ; AVX1-FAST: # %bb.0:
83 ; AVX1-FAST-NEXT: vpermilpd {{.*#+}} xmm0 = xmm1[1,0]
84 ; AVX1-FAST-NEXT: vaddps %xmm0, %xmm1, %xmm0
85 ; AVX1-FAST-NEXT: vhaddps %xmm0, %xmm0, %xmm0
92 ; AVX1-FAST-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
93 ; AVX1-FAST-NEXT: vaddps %xmm2, %xmm1, %xmm1
94 ; AVX1-FAST-NEXT: vhaddps %xmm1, %xmm1, %xmm1
95 ; AVX1-FAST-NEXT: vaddss %xmm1, %xmm0, %xmm0
8696 ; AVX1-FAST-NEXT: retq
8797 ;
8898 ; AVX2-LABEL: test_v4f32:
8999 ; AVX2: # %bb.0:
90 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm0 = xmm1[1,0]
91 ; AVX2-NEXT: vaddps %xmm0, %xmm1, %xmm0
92 ; AVX2-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
100 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
101 ; AVX2-NEXT: vaddps %xmm2, %xmm1, %xmm1
102 ; AVX2-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
103 ; AVX2-NEXT: vaddss %xmm2, %xmm1, %xmm1
93104 ; AVX2-NEXT: vaddss %xmm1, %xmm0, %xmm0
94105 ; AVX2-NEXT: retq
95106 ;
96107 ; AVX512-LABEL: test_v4f32:
97108 ; AVX512: # %bb.0:
98 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm0 = xmm1[1,0]
99 ; AVX512-NEXT: vaddps %xmm0, %xmm1, %xmm0
100 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
109 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
110 ; AVX512-NEXT: vaddps %xmm2, %xmm1, %xmm1
111 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
112 ; AVX512-NEXT: vaddss %xmm2, %xmm1, %xmm1
101113 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
102114 ; AVX512-NEXT: retq
103 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float %a0, <4 x float> %a1)
115 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %a0, <4 x float> %a1)
104116 ret float %1
105117 }
106118
111123 ; SSE2-NEXT: movaps %xmm1, %xmm2
112124 ; SSE2-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
113125 ; SSE2-NEXT: addps %xmm1, %xmm2
114 ; SSE2-NEXT: movaps %xmm2, %xmm0
115 ; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1],xmm2[2,3]
116 ; SSE2-NEXT: addss %xmm2, %xmm0
126 ; SSE2-NEXT: movaps %xmm2, %xmm1
127 ; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1],xmm2[2,3]
128 ; SSE2-NEXT: addss %xmm2, %xmm1
129 ; SSE2-NEXT: addss %xmm1, %xmm0
117130 ; SSE2-NEXT: retq
118131 ;
119132 ; SSE41-LABEL: test_v8f32:
122135 ; SSE41-NEXT: movaps %xmm1, %xmm2
123136 ; SSE41-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
124137 ; SSE41-NEXT: addps %xmm1, %xmm2
125 ; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm2[1,1,3,3]
126 ; SSE41-NEXT: addss %xmm2, %xmm0
138 ; SSE41-NEXT: movshdup {{.*#+}} xmm1 = xmm2[1,1,3,3]
139 ; SSE41-NEXT: addss %xmm2, %xmm1
140 ; SSE41-NEXT: addss %xmm1, %xmm0
127141 ; SSE41-NEXT: retq
128142 ;
129143 ; AVX1-SLOW-LABEL: test_v8f32:
130144 ; AVX1-SLOW: # %bb.0:
131 ; AVX1-SLOW-NEXT: vextractf128 $1, %ymm1, %xmm0
132 ; AVX1-SLOW-NEXT: vaddps %xmm0, %xmm1, %xmm0
133 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
134 ; AVX1-SLOW-NEXT: vaddps %xmm1, %xmm0, %xmm0
135 ; AVX1-SLOW-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
145 ; AVX1-SLOW-NEXT: vextractf128 $1, %ymm1, %xmm2
146 ; AVX1-SLOW-NEXT: vaddps %xmm2, %xmm1, %xmm1
147 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
148 ; AVX1-SLOW-NEXT: vaddps %xmm2, %xmm1, %xmm1
149 ; AVX1-SLOW-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
150 ; AVX1-SLOW-NEXT: vaddss %xmm2, %xmm1, %xmm1
136151 ; AVX1-SLOW-NEXT: vaddss %xmm1, %xmm0, %xmm0
137152 ; AVX1-SLOW-NEXT: vzeroupper
138153 ; AVX1-SLOW-NEXT: retq
139154 ;
140155 ; AVX1-FAST-LABEL: test_v8f32:
141156 ; AVX1-FAST: # %bb.0:
142 ; AVX1-FAST-NEXT: vextractf128 $1, %ymm1, %xmm0
143 ; AVX1-FAST-NEXT: vaddps %xmm0, %xmm1, %xmm0
144 ; AVX1-FAST-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
145 ; AVX1-FAST-NEXT: vaddps %xmm1, %xmm0, %xmm0
146 ; AVX1-FAST-NEXT: vhaddps %xmm0, %xmm0, %xmm0
157 ; AVX1-FAST-NEXT: vextractf128 $1, %ymm1, %xmm2
158 ; AVX1-FAST-NEXT: vaddps %xmm2, %xmm1, %xmm1
159 ; AVX1-FAST-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
160 ; AVX1-FAST-NEXT: vaddps %xmm2, %xmm1, %xmm1
161 ; AVX1-FAST-NEXT: vhaddps %xmm1, %xmm1, %xmm1
162 ; AVX1-FAST-NEXT: vaddss %xmm1, %xmm0, %xmm0
147163 ; AVX1-FAST-NEXT: vzeroupper
148164 ; AVX1-FAST-NEXT: retq
149165 ;
150166 ; AVX2-LABEL: test_v8f32:
151167 ; AVX2: # %bb.0:
152 ; AVX2-NEXT: vextractf128 $1, %ymm1, %xmm0
153 ; AVX2-NEXT: vaddps %xmm0, %xmm1, %xmm0
154 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
155 ; AVX2-NEXT: vaddps %xmm1, %xmm0, %xmm0
156 ; AVX2-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
168 ; AVX2-NEXT: vextractf128 $1, %ymm1, %xmm2
169 ; AVX2-NEXT: vaddps %xmm2, %xmm1, %xmm1
170 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
171 ; AVX2-NEXT: vaddps %xmm2, %xmm1, %xmm1
172 ; AVX2-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
173 ; AVX2-NEXT: vaddss %xmm2, %xmm1, %xmm1
157174 ; AVX2-NEXT: vaddss %xmm1, %xmm0, %xmm0
158175 ; AVX2-NEXT: vzeroupper
159176 ; AVX2-NEXT: retq
160177 ;
161178 ; AVX512-LABEL: test_v8f32:
162179 ; AVX512: # %bb.0:
163 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm0
164 ; AVX512-NEXT: vaddps %xmm0, %xmm1, %xmm0
165 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
166 ; AVX512-NEXT: vaddps %xmm1, %xmm0, %xmm0
167 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
180 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm2
181 ; AVX512-NEXT: vaddps %xmm2, %xmm1, %xmm1
182 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
183 ; AVX512-NEXT: vaddps %xmm2, %xmm1, %xmm1
184 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
185 ; AVX512-NEXT: vaddss %xmm2, %xmm1, %xmm1
168186 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
169187 ; AVX512-NEXT: vzeroupper
170188 ; AVX512-NEXT: retq
171 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float %a0, <8 x float> %a1)
189 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float %a0, <8 x float> %a1)
172190 ret float %1
173191 }
174192
181199 ; SSE2-NEXT: movaps %xmm1, %xmm2
182200 ; SSE2-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
183201 ; SSE2-NEXT: addps %xmm1, %xmm2
184 ; SSE2-NEXT: movaps %xmm2, %xmm0
185 ; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1],xmm2[2,3]
186 ; SSE2-NEXT: addss %xmm2, %xmm0
202 ; SSE2-NEXT: movaps %xmm2, %xmm1
203 ; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1],xmm2[2,3]
204 ; SSE2-NEXT: addss %xmm2, %xmm1
205 ; SSE2-NEXT: addss %xmm1, %xmm0
187206 ; SSE2-NEXT: retq
188207 ;
189208 ; SSE41-LABEL: test_v16f32:
194213 ; SSE41-NEXT: movaps %xmm1, %xmm2
195214 ; SSE41-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
196215 ; SSE41-NEXT: addps %xmm1, %xmm2
197 ; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm2[1,1,3,3]
198 ; SSE41-NEXT: addss %xmm2, %xmm0
216 ; SSE41-NEXT: movshdup {{.*#+}} xmm1 = xmm2[1,1,3,3]
217 ; SSE41-NEXT: addss %xmm2, %xmm1
218 ; SSE41-NEXT: addss %xmm1, %xmm0
199219 ; SSE41-NEXT: retq
200220 ;
201221 ; AVX1-SLOW-LABEL: test_v16f32:
202222 ; AVX1-SLOW: # %bb.0:
203 ; AVX1-SLOW-NEXT: vaddps %ymm2, %ymm1, %ymm0
204 ; AVX1-SLOW-NEXT: vextractf128 $1, %ymm0, %xmm1
205 ; AVX1-SLOW-NEXT: vaddps %xmm1, %xmm0, %xmm0
206 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
207 ; AVX1-SLOW-NEXT: vaddps %xmm1, %xmm0, %xmm0
208 ; AVX1-SLOW-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
223 ; AVX1-SLOW-NEXT: vaddps %ymm2, %ymm1, %ymm1
224 ; AVX1-SLOW-NEXT: vextractf128 $1, %ymm1, %xmm2
225 ; AVX1-SLOW-NEXT: vaddps %xmm2, %xmm1, %xmm1
226 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
227 ; AVX1-SLOW-NEXT: vaddps %xmm2, %xmm1, %xmm1
228 ; AVX1-SLOW-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
229 ; AVX1-SLOW-NEXT: vaddss %xmm2, %xmm1, %xmm1
209230 ; AVX1-SLOW-NEXT: vaddss %xmm1, %xmm0, %xmm0
210231 ; AVX1-SLOW-NEXT: vzeroupper
211232 ; AVX1-SLOW-NEXT: retq
212233 ;
213234 ; AVX1-FAST-LABEL: test_v16f32:
214235 ; AVX1-FAST: # %bb.0:
215 ; AVX1-FAST-NEXT: vaddps %ymm2, %ymm1, %ymm0
216 ; AVX1-FAST-NEXT: vextractf128 $1, %ymm0, %xmm1
217 ; AVX1-FAST-NEXT: vaddps %xmm1, %xmm0, %xmm0
218 ; AVX1-FAST-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
219 ; AVX1-FAST-NEXT: vaddps %xmm1, %xmm0, %xmm0
220 ; AVX1-FAST-NEXT: vhaddps %xmm0, %xmm0, %xmm0
236 ; AVX1-FAST-NEXT: vaddps %ymm2, %ymm1, %ymm1
237 ; AVX1-FAST-NEXT: vextractf128 $1, %ymm1, %xmm2
238 ; AVX1-FAST-NEXT: vaddps %xmm2, %xmm1, %xmm1
239 ; AVX1-FAST-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
240 ; AVX1-FAST-NEXT: vaddps %xmm2, %xmm1, %xmm1
241 ; AVX1-FAST-NEXT: vhaddps %xmm1, %xmm1, %xmm1
242 ; AVX1-FAST-NEXT: vaddss %xmm1, %xmm0, %xmm0
221243 ; AVX1-FAST-NEXT: vzeroupper
222244 ; AVX1-FAST-NEXT: retq
223245 ;
224246 ; AVX2-LABEL: test_v16f32:
225247 ; AVX2: # %bb.0:
226 ; AVX2-NEXT: vaddps %ymm2, %ymm1, %ymm0
227 ; AVX2-NEXT: vextractf128 $1, %ymm0, %xmm1
228 ; AVX2-NEXT: vaddps %xmm1, %xmm0, %xmm0
229 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
230 ; AVX2-NEXT: vaddps %xmm1, %xmm0, %xmm0
231 ; AVX2-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
248 ; AVX2-NEXT: vaddps %ymm2, %ymm1, %ymm1
249 ; AVX2-NEXT: vextractf128 $1, %ymm1, %xmm2
250 ; AVX2-NEXT: vaddps %xmm2, %xmm1, %xmm1
251 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
252 ; AVX2-NEXT: vaddps %xmm2, %xmm1, %xmm1
253 ; AVX2-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
254 ; AVX2-NEXT: vaddss %xmm2, %xmm1, %xmm1
232255 ; AVX2-NEXT: vaddss %xmm1, %xmm0, %xmm0
233256 ; AVX2-NEXT: vzeroupper
234257 ; AVX2-NEXT: retq
235258 ;
236259 ; AVX512-LABEL: test_v16f32:
237260 ; AVX512: # %bb.0:
238 ; AVX512-NEXT: vextractf64x4 $1, %zmm1, %ymm0
239 ; AVX512-NEXT: vaddps %zmm0, %zmm1, %zmm0
240 ; AVX512-NEXT: vextractf128 $1, %ymm0, %xmm1
241 ; AVX512-NEXT: vaddps %xmm1, %xmm0, %xmm0
242 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
243 ; AVX512-NEXT: vaddps %xmm1, %xmm0, %xmm0
244 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
261 ; AVX512-NEXT: vextractf64x4 $1, %zmm1, %ymm2
262 ; AVX512-NEXT: vaddps %zmm2, %zmm1, %zmm1
263 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm2
264 ; AVX512-NEXT: vaddps %xmm2, %xmm1, %xmm1
265 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
266 ; AVX512-NEXT: vaddps %xmm2, %xmm1, %xmm1
267 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
268 ; AVX512-NEXT: vaddss %xmm2, %xmm1, %xmm1
245269 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
246270 ; AVX512-NEXT: vzeroupper
247271 ; AVX512-NEXT: retq
248 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v16f32(float %a0, <16 x float> %a1)
272 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v16f32(float %a0, <16 x float> %a1)
249273 ret float %1
250274 }
251275
290314 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
291315 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
292316 ; AVX512-NEXT: retq
293 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float 0.0, <2 x float> %a0)
317 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v2f32(float 0.0, <2 x float> %a0)
294318 ret float %1
295319 }
296320
345369 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
346370 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
347371 ; AVX512-NEXT: retq
348 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float 0.0, <4 x float> %a0)
372 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.0, <4 x float> %a0)
349373 ret float %1
350374 }
351375
414438 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
415439 ; AVX512-NEXT: vzeroupper
416440 ; AVX512-NEXT: retq
417 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float 0.0, <8 x float> %a0)
441 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float 0.0, <8 x float> %a0)
418442 ret float %1
419443 }
420444
492516 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
493517 ; AVX512-NEXT: vzeroupper
494518 ; AVX512-NEXT: retq
495 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v16f32(float 0.0, <16 x float> %a0)
519 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v16f32(float 0.0, <16 x float> %a0)
496520 ret float %1
497521 }
498522
537561 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
538562 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
539563 ; AVX512-NEXT: retq
540 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float undef, <2 x float> %a0)
564 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v2f32(float 0.0, <2 x float> %a0)
541565 ret float %1
542566 }
543567
592616 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
593617 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
594618 ; AVX512-NEXT: retq
595 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %a0)
619 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.0, <4 x float> %a0)
596620 ret float %1
597621 }
598622
661685 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
662686 ; AVX512-NEXT: vzeroupper
663687 ; AVX512-NEXT: retq
664 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float undef, <8 x float> %a0)
688 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float 0.0, <8 x float> %a0)
665689 ret float %1
666690 }
667691
739763 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
740764 ; AVX512-NEXT: vzeroupper
741765 ; AVX512-NEXT: retq
742 %1 = call fast float @llvm.experimental.vector.reduce.fadd.f32.v16f32(float undef, <16 x float> %a0)
766 %1 = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v16f32(float 0.0, <16 x float> %a0)
743767 ret float %1
744768 }
745769
750774 define double @test_v2f64(double %a0, <2 x double> %a1) {
751775 ; SSE-LABEL: test_v2f64:
752776 ; SSE: # %bb.0:
753 ; SSE-NEXT: movapd %xmm1, %xmm0
754 ; SSE-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1]
755 ; SSE-NEXT: addsd %xmm1, %xmm0
777 ; SSE-NEXT: movapd %xmm1, %xmm2
778 ; SSE-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
779 ; SSE-NEXT: addsd %xmm1, %xmm2
780 ; SSE-NEXT: addsd %xmm2, %xmm0
756781 ; SSE-NEXT: retq
757782 ;
758783 ; AVX1-SLOW-LABEL: test_v2f64:
759784 ; AVX1-SLOW: # %bb.0:
760 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm0 = xmm1[1,0]
761 ; AVX1-SLOW-NEXT: vaddsd %xmm0, %xmm1, %xmm0
785 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
786 ; AVX1-SLOW-NEXT: vaddsd %xmm2, %xmm1, %xmm1
787 ; AVX1-SLOW-NEXT: vaddsd %xmm1, %xmm0, %xmm0
762788 ; AVX1-SLOW-NEXT: retq
763789 ;
764790 ; AVX1-FAST-LABEL: test_v2f64:
765791 ; AVX1-FAST: # %bb.0:
766 ; AVX1-FAST-NEXT: vhaddpd %xmm1, %xmm1, %xmm0
792 ; AVX1-FAST-NEXT: vhaddpd %xmm1, %xmm1, %xmm1
793 ; AVX1-FAST-NEXT: vaddsd %xmm1, %xmm0, %xmm0
767794 ; AVX1-FAST-NEXT: retq
768795 ;
769796 ; AVX2-LABEL: test_v2f64:
770797 ; AVX2: # %bb.0:
771 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm0 = xmm1[1,0]
772 ; AVX2-NEXT: vaddsd %xmm0, %xmm1, %xmm0
798 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
799 ; AVX2-NEXT: vaddsd %xmm2, %xmm1, %xmm1
800 ; AVX2-NEXT: vaddsd %xmm1, %xmm0, %xmm0
773801 ; AVX2-NEXT: retq
774802 ;
775803 ; AVX512-LABEL: test_v2f64:
776804 ; AVX512: # %bb.0:
777 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm0 = xmm1[1,0]
778 ; AVX512-NEXT: vaddsd %xmm0, %xmm1, %xmm0
779 ; AVX512-NEXT: retq
780 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double %a0, <2 x double> %a1)
805 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
806 ; AVX512-NEXT: vaddsd %xmm2, %xmm1, %xmm1
807 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
808 ; AVX512-NEXT: retq
809 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64(double %a0, <2 x double> %a1)
781810 ret double %1
782811 }
783812
785814 ; SSE-LABEL: test_v4f64:
786815 ; SSE: # %bb.0:
787816 ; SSE-NEXT: addpd %xmm2, %xmm1
788 ; SSE-NEXT: movapd %xmm1, %xmm0
789 ; SSE-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1]
790 ; SSE-NEXT: addsd %xmm1, %xmm0
817 ; SSE-NEXT: movapd %xmm1, %xmm2
818 ; SSE-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
819 ; SSE-NEXT: addsd %xmm1, %xmm2
820 ; SSE-NEXT: addsd %xmm2, %xmm0
791821 ; SSE-NEXT: retq
792822 ;
793823 ; AVX1-SLOW-LABEL: test_v4f64:
794824 ; AVX1-SLOW: # %bb.0:
795 ; AVX1-SLOW-NEXT: vextractf128 $1, %ymm1, %xmm0
796 ; AVX1-SLOW-NEXT: vaddpd %xmm0, %xmm1, %xmm0
797 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
825 ; AVX1-SLOW-NEXT: vextractf128 $1, %ymm1, %xmm2
826 ; AVX1-SLOW-NEXT: vaddpd %xmm2, %xmm1, %xmm1
827 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
828 ; AVX1-SLOW-NEXT: vaddsd %xmm2, %xmm1, %xmm1
798829 ; AVX1-SLOW-NEXT: vaddsd %xmm1, %xmm0, %xmm0
799830 ; AVX1-SLOW-NEXT: vzeroupper
800831 ; AVX1-SLOW-NEXT: retq
801832 ;
802833 ; AVX1-FAST-LABEL: test_v4f64:
803834 ; AVX1-FAST: # %bb.0:
804 ; AVX1-FAST-NEXT: vextractf128 $1, %ymm1, %xmm0
805 ; AVX1-FAST-NEXT: vaddpd %xmm0, %xmm1, %xmm0
806 ; AVX1-FAST-NEXT: vhaddpd %xmm0, %xmm0, %xmm0
835 ; AVX1-FAST-NEXT: vextractf128 $1, %ymm1, %xmm2
836 ; AVX1-FAST-NEXT: vaddpd %xmm2, %xmm1, %xmm1
837 ; AVX1-FAST-NEXT: vhaddpd %xmm1, %xmm1, %xmm1
838 ; AVX1-FAST-NEXT: vaddsd %xmm1, %xmm0, %xmm0
807839 ; AVX1-FAST-NEXT: vzeroupper
808840 ; AVX1-FAST-NEXT: retq
809841 ;
810842 ; AVX2-LABEL: test_v4f64:
811843 ; AVX2: # %bb.0:
812 ; AVX2-NEXT: vextractf128 $1, %ymm1, %xmm0
813 ; AVX2-NEXT: vaddpd %xmm0, %xmm1, %xmm0
814 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
844 ; AVX2-NEXT: vextractf128 $1, %ymm1, %xmm2
845 ; AVX2-NEXT: vaddpd %xmm2, %xmm1, %xmm1
846 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
847 ; AVX2-NEXT: vaddsd %xmm2, %xmm1, %xmm1
815848 ; AVX2-NEXT: vaddsd %xmm1, %xmm0, %xmm0
816849 ; AVX2-NEXT: vzeroupper
817850 ; AVX2-NEXT: retq
818851 ;
819852 ; AVX512-LABEL: test_v4f64:
820853 ; AVX512: # %bb.0:
821 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm0
822 ; AVX512-NEXT: vaddpd %xmm0, %xmm1, %xmm0
823 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
854 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm2
855 ; AVX512-NEXT: vaddpd %xmm2, %xmm1, %xmm1
856 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
857 ; AVX512-NEXT: vaddsd %xmm2, %xmm1, %xmm1
824858 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
825859 ; AVX512-NEXT: vzeroupper
826860 ; AVX512-NEXT: retq
827 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double %a0, <4 x double> %a1)
861 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double %a0, <4 x double> %a1)
828862 ret double %1
829863 }
830864
834868 ; SSE-NEXT: addpd %xmm4, %xmm2
835869 ; SSE-NEXT: addpd %xmm3, %xmm1
836870 ; SSE-NEXT: addpd %xmm2, %xmm1
837 ; SSE-NEXT: movapd %xmm1, %xmm0
838 ; SSE-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1]
839 ; SSE-NEXT: addsd %xmm1, %xmm0
871 ; SSE-NEXT: movapd %xmm1, %xmm2
872 ; SSE-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
873 ; SSE-NEXT: addsd %xmm1, %xmm2
874 ; SSE-NEXT: addsd %xmm2, %xmm0
840875 ; SSE-NEXT: retq
841876 ;
842877 ; AVX1-SLOW-LABEL: test_v8f64:
843878 ; AVX1-SLOW: # %bb.0:
844 ; AVX1-SLOW-NEXT: vaddpd %ymm2, %ymm1, %ymm0
845 ; AVX1-SLOW-NEXT: vextractf128 $1, %ymm0, %xmm1
846 ; AVX1-SLOW-NEXT: vaddpd %xmm1, %xmm0, %xmm0
847 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
879 ; AVX1-SLOW-NEXT: vaddpd %ymm2, %ymm1, %ymm1
880 ; AVX1-SLOW-NEXT: vextractf128 $1, %ymm1, %xmm2
881 ; AVX1-SLOW-NEXT: vaddpd %xmm2, %xmm1, %xmm1
882 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
883 ; AVX1-SLOW-NEXT: vaddsd %xmm2, %xmm1, %xmm1
848884 ; AVX1-SLOW-NEXT: vaddsd %xmm1, %xmm0, %xmm0
849885 ; AVX1-SLOW-NEXT: vzeroupper
850886 ; AVX1-SLOW-NEXT: retq
851887 ;
852888 ; AVX1-FAST-LABEL: test_v8f64:
853889 ; AVX1-FAST: # %bb.0:
854 ; AVX1-FAST-NEXT: vaddpd %ymm2, %ymm1, %ymm0
855 ; AVX1-FAST-NEXT: vextractf128 $1, %ymm0, %xmm1
856 ; AVX1-FAST-NEXT: vaddpd %xmm1, %xmm0, %xmm0
857 ; AVX1-FAST-NEXT: vhaddpd %xmm0, %xmm0, %xmm0
890 ; AVX1-FAST-NEXT: vaddpd %ymm2, %ymm1, %ymm1
891 ; AVX1-FAST-NEXT: vextractf128 $1, %ymm1, %xmm2
892 ; AVX1-FAST-NEXT: vaddpd %xmm2, %xmm1, %xmm1
893 ; AVX1-FAST-NEXT: vhaddpd %xmm1, %xmm1, %xmm1
894 ; AVX1-FAST-NEXT: vaddsd %xmm1, %xmm0, %xmm0
858895 ; AVX1-FAST-NEXT: vzeroupper
859896 ; AVX1-FAST-NEXT: retq
860897 ;
861898 ; AVX2-LABEL: test_v8f64:
862899 ; AVX2: # %bb.0:
863 ; AVX2-NEXT: vaddpd %ymm2, %ymm1, %ymm0
864 ; AVX2-NEXT: vextractf128 $1, %ymm0, %xmm1
865 ; AVX2-NEXT: vaddpd %xmm1, %xmm0, %xmm0
866 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
900 ; AVX2-NEXT: vaddpd %ymm2, %ymm1, %ymm1
901 ; AVX2-NEXT: vextractf128 $1, %ymm1, %xmm2
902 ; AVX2-NEXT: vaddpd %xmm2, %xmm1, %xmm1
903 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
904 ; AVX2-NEXT: vaddsd %xmm2, %xmm1, %xmm1
867905 ; AVX2-NEXT: vaddsd %xmm1, %xmm0, %xmm0
868906 ; AVX2-NEXT: vzeroupper
869907 ; AVX2-NEXT: retq
870908 ;
871909 ; AVX512-LABEL: test_v8f64:
872910 ; AVX512: # %bb.0:
873 ; AVX512-NEXT: vextractf64x4 $1, %zmm1, %ymm0
874 ; AVX512-NEXT: vaddpd %zmm0, %zmm1, %zmm0
875 ; AVX512-NEXT: vextractf128 $1, %ymm0, %xmm1
876 ; AVX512-NEXT: vaddpd %xmm1, %xmm0, %xmm0
877 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
911 ; AVX512-NEXT: vextractf64x4 $1, %zmm1, %ymm2
912 ; AVX512-NEXT: vaddpd %zmm2, %zmm1, %zmm1
913 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm2
914 ; AVX512-NEXT: vaddpd %xmm2, %xmm1, %xmm1
915 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
916 ; AVX512-NEXT: vaddsd %xmm2, %xmm1, %xmm1
878917 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
879918 ; AVX512-NEXT: vzeroupper
880919 ; AVX512-NEXT: retq
881 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v8f64(double %a0, <8 x double> %a1)
920 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v8f64(double %a0, <8 x double> %a1)
882921 ret double %1
883922 }
884923
892931 ; SSE-NEXT: addpd {{[0-9]+}}(%rsp), %xmm4
893932 ; SSE-NEXT: addpd %xmm2, %xmm4
894933 ; SSE-NEXT: addpd %xmm1, %xmm4
895 ; SSE-NEXT: movapd %xmm4, %xmm0
896 ; SSE-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm4[1]
897 ; SSE-NEXT: addsd %xmm4, %xmm0
934 ; SSE-NEXT: movapd %xmm4, %xmm1
935 ; SSE-NEXT: unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm4[1]
936 ; SSE-NEXT: addsd %xmm4, %xmm1
937 ; SSE-NEXT: addsd %xmm1, %xmm0
898938 ; SSE-NEXT: retq
899939 ;
900940 ; AVX1-SLOW-LABEL: test_v16f64:
901941 ; AVX1-SLOW: # %bb.0:
902 ; AVX1-SLOW-NEXT: vaddpd %ymm4, %ymm2, %ymm0
942 ; AVX1-SLOW-NEXT: vaddpd %ymm4, %ymm2, %ymm2
903943 ; AVX1-SLOW-NEXT: vaddpd %ymm3, %ymm1, %ymm1
904 ; AVX1-SLOW-NEXT: vaddpd %ymm0, %ymm1, %ymm0
905 ; AVX1-SLOW-NEXT: vextractf128 $1, %ymm0, %xmm1
906 ; AVX1-SLOW-NEXT: vaddpd %xmm1, %xmm0, %xmm0
907 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
944 ; AVX1-SLOW-NEXT: vaddpd %ymm2, %ymm1, %ymm1
945 ; AVX1-SLOW-NEXT: vextractf128 $1, %ymm1, %xmm2
946 ; AVX1-SLOW-NEXT: vaddpd %xmm2, %xmm1, %xmm1
947 ; AVX1-SLOW-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
948 ; AVX1-SLOW-NEXT: vaddsd %xmm2, %xmm1, %xmm1
908949 ; AVX1-SLOW-NEXT: vaddsd %xmm1, %xmm0, %xmm0
909950 ; AVX1-SLOW-NEXT: vzeroupper
910951 ; AVX1-SLOW-NEXT: retq
911952 ;
912953 ; AVX1-FAST-LABEL: test_v16f64:
913954 ; AVX1-FAST: # %bb.0:
914 ; AVX1-FAST-NEXT: vaddpd %ymm4, %ymm2, %ymm0
955 ; AVX1-FAST-NEXT: vaddpd %ymm4, %ymm2, %ymm2
915956 ; AVX1-FAST-NEXT: vaddpd %ymm3, %ymm1, %ymm1
916 ; AVX1-FAST-NEXT: vaddpd %ymm0, %ymm1, %ymm0
917 ; AVX1-FAST-NEXT: vextractf128 $1, %ymm0, %xmm1
918 ; AVX1-FAST-NEXT: vaddpd %xmm1, %xmm0, %xmm0
919 ; AVX1-FAST-NEXT: vhaddpd %xmm0, %xmm0, %xmm0
957 ; AVX1-FAST-NEXT: vaddpd %ymm2, %ymm1, %ymm1
958 ; AVX1-FAST-NEXT: vextractf128 $1, %ymm1, %xmm2
959 ; AVX1-FAST-NEXT: vaddpd %xmm2, %xmm1, %xmm1
960 ; AVX1-FAST-NEXT: vhaddpd %xmm1, %xmm1, %xmm1
961 ; AVX1-FAST-NEXT: vaddsd %xmm1, %xmm0, %xmm0
920962 ; AVX1-FAST-NEXT: vzeroupper
921963 ; AVX1-FAST-NEXT: retq
922964 ;
923965 ; AVX2-LABEL: test_v16f64:
924966 ; AVX2: # %bb.0:
925 ; AVX2-NEXT: vaddpd %ymm4, %ymm2, %ymm0
967 ; AVX2-NEXT: vaddpd %ymm4, %ymm2, %ymm2
926968 ; AVX2-NEXT: vaddpd %ymm3, %ymm1, %ymm1
927 ; AVX2-NEXT: vaddpd %ymm0, %ymm1, %ymm0
928 ; AVX2-NEXT: vextractf128 $1, %ymm0, %xmm1
929 ; AVX2-NEXT: vaddpd %xmm1, %xmm0, %xmm0
930 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
969 ; AVX2-NEXT: vaddpd %ymm2, %ymm1, %ymm1
970 ; AVX2-NEXT: vextractf128 $1, %ymm1, %xmm2
971 ; AVX2-NEXT: vaddpd %xmm2, %xmm1, %xmm1
972 ; AVX2-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
973 ; AVX2-NEXT: vaddsd %xmm2, %xmm1, %xmm1
931974 ; AVX2-NEXT: vaddsd %xmm1, %xmm0, %xmm0
932975 ; AVX2-NEXT: vzeroupper
933976 ; AVX2-NEXT: retq
934977 ;
935978 ; AVX512-LABEL: test_v16f64:
936979 ; AVX512: # %bb.0:
937 ; AVX512-NEXT: vaddpd %zmm2, %zmm1, %zmm0
938 ; AVX512-NEXT: vextractf64x4 $1, %zmm0, %ymm1
939 ; AVX512-NEXT: vaddpd %zmm1, %zmm0, %zmm0
940 ; AVX512-NEXT: vextractf128 $1, %ymm0, %xmm1
941 ; AVX512-NEXT: vaddpd %xmm1, %xmm0, %xmm0
942 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
980 ; AVX512-NEXT: vaddpd %zmm2, %zmm1, %zmm1
981 ; AVX512-NEXT: vextractf64x4 $1, %zmm1, %ymm2
982 ; AVX512-NEXT: vaddpd %zmm2, %zmm1, %zmm1
983 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm2
984 ; AVX512-NEXT: vaddpd %xmm2, %xmm1, %xmm1
985 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
986 ; AVX512-NEXT: vaddsd %xmm2, %xmm1, %xmm1
943987 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
944988 ; AVX512-NEXT: vzeroupper
945989 ; AVX512-NEXT: retq
946 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v16f64(double %a0, <16 x double> %a1)
990 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v16f64(double %a0, <16 x double> %a1)
947991 ret double %1
948992 }
949993
9821026 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
9831027 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
9841028 ; AVX512-NEXT: retq
985 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double 0.0, <2 x double> %a0)
1029 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64(double 0.0, <2 x double> %a0)
9861030 ret double %1
9871031 }
9881032
10301074 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
10311075 ; AVX512-NEXT: vzeroupper
10321076 ; AVX512-NEXT: retq
1033 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double 0.0, <4 x double> %a0)
1077 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double 0.0, <4 x double> %a0)
10341078 ret double %1
10351079 }
10361080
10851129 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
10861130 ; AVX512-NEXT: vzeroupper
10871131 ; AVX512-NEXT: retq
1088 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v8f64(double 0.0, <8 x double> %a0)
1132 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v8f64(double 0.0, <8 x double> %a0)
10891133 ret double %1
10901134 }
10911135
11501194 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
11511195 ; AVX512-NEXT: vzeroupper
11521196 ; AVX512-NEXT: retq
1153 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v16f64(double 0.0, <16 x double> %a0)
1197 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v16f64(double 0.0, <16 x double> %a0)
11541198 ret double %1
11551199 }
11561200
11891233 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
11901234 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
11911235 ; AVX512-NEXT: retq
1192 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double undef, <2 x double> %a0)
1236 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64(double 0.0, <2 x double> %a0)
11931237 ret double %1
11941238 }
11951239
12371281 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
12381282 ; AVX512-NEXT: vzeroupper
12391283 ; AVX512-NEXT: retq
1240 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double undef, <4 x double> %a0)
1284 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double 0.0, <4 x double> %a0)
12411285 ret double %1
12421286 }
12431287
12921336 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
12931337 ; AVX512-NEXT: vzeroupper
12941338 ; AVX512-NEXT: retq
1295 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v8f64(double undef, <8 x double> %a0)
1339 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v8f64(double 0.0, <8 x double> %a0)
12961340 ret double %1
12971341 }
12981342
13571401 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
13581402 ; AVX512-NEXT: vzeroupper
13591403 ; AVX512-NEXT: retq
1360 %1 = call fast double @llvm.experimental.vector.reduce.fadd.f64.v16f64(double undef, <16 x double> %a0)
1404 %1 = call fast double @llvm.experimental.vector.reduce.v2.fadd.f64.v16f64(double 0.0, <16 x double> %a0)
13611405 ret double %1
13621406 }
13631407
1364 declare float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float, <2 x float>)
1365 declare float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float, <4 x float>)
1366 declare float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float, <8 x float>)
1367 declare float @llvm.experimental.vector.reduce.fadd.f32.v16f32(float, <16 x float>)
1368
1369 declare double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double, <2 x double>)
1370 declare double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double, <4 x double>)
1371 declare double @llvm.experimental.vector.reduce.fadd.f64.v8f64(double, <8 x double>)
1372 declare double @llvm.experimental.vector.reduce.fadd.f64.v16f64(double, <16 x double>)
1408 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v2f32(float, <2 x float>)
1409 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float, <4 x float>)
1410 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float, <8 x float>)
1411 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v16f32(float, <16 x float>)
1412
1413 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64(double, <2 x double>)
1414 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double, <4 x double>)
1415 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v8f64(double, <8 x double>)
1416 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v16f64(double, <16 x double>)
3838 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm1[1,1,3,3]
3939 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
4040 ; AVX512-NEXT: retq
41 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float %a0, <2 x float> %a1)
41 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v2f32(float %a0, <2 x float> %a1)
4242 ret float %1
4343 }
4444
8989 ; AVX512-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[3,1,2,3]
9090 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
9191 ; AVX512-NEXT: retq
92 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float %a0, <4 x float> %a1)
92 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %a0, <4 x float> %a1)
9393 ret float %1
9494 }
9595
175175 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
176176 ; AVX512-NEXT: vzeroupper
177177 ; AVX512-NEXT: retq
178 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float %a0, <8 x float> %a1)
178 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float %a0, <8 x float> %a1)
179179 ret float %1
180180 }
181181
326326 ; AVX512-NEXT: vaddss %xmm1, %xmm0, %xmm0
327327 ; AVX512-NEXT: vzeroupper
328328 ; AVX512-NEXT: retq
329 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v16f32(float %a0, <16 x float> %a1)
329 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v16f32(float %a0, <16 x float> %a1)
330330 ret float %1
331331 }
332332
366366 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm0 = xmm0[1,1,3,3]
367367 ; AVX512-NEXT: vaddss %xmm0, %xmm1, %xmm0
368368 ; AVX512-NEXT: retq
369 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float 0.0, <2 x float> %a0)
369 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v2f32(float 0.0, <2 x float> %a0)
370370 ret float %1
371371 }
372372
421421 ; AVX512-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
422422 ; AVX512-NEXT: vaddss %xmm0, %xmm1, %xmm0
423423 ; AVX512-NEXT: retq
424 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float 0.0, <4 x float> %a0)
424 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.0, <4 x float> %a0)
425425 ret float %1
426426 }
427427
511511 ; AVX512-NEXT: vaddss %xmm0, %xmm1, %xmm0
512512 ; AVX512-NEXT: vzeroupper
513513 ; AVX512-NEXT: retq
514 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float 0.0, <8 x float> %a0)
514 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float 0.0, <8 x float> %a0)
515515 ret float %1
516516 }
517517
666666 ; AVX512-NEXT: vaddss %xmm0, %xmm1, %xmm0
667667 ; AVX512-NEXT: vzeroupper
668668 ; AVX512-NEXT: retq
669 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v16f32(float 0.0, <16 x float> %a0)
669 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v16f32(float 0.0, <16 x float> %a0)
670670 ret float %1
671671 }
672672
698698 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm0 = xmm0[1,1,3,3]
699699 ; AVX512-NEXT: vaddss {{.*}}(%rip), %xmm0, %xmm0
700700 ; AVX512-NEXT: retq
701 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float undef, <2 x float> %a0)
701 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v2f32(float undef, <2 x float> %a0)
702702 ret float %1
703703 }
704704
745745 ; AVX512-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
746746 ; AVX512-NEXT: vaddss %xmm0, %xmm1, %xmm0
747747 ; AVX512-NEXT: retq
748 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %a0)
748 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float undef, <4 x float> %a0)
749749 ret float %1
750750 }
751751
827827 ; AVX512-NEXT: vaddss %xmm0, %xmm1, %xmm0
828828 ; AVX512-NEXT: vzeroupper
829829 ; AVX512-NEXT: retq
830 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float undef, <8 x float> %a0)
830 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float undef, <8 x float> %a0)
831831 ret float %1
832832 }
833833
974974 ; AVX512-NEXT: vaddss %xmm0, %xmm1, %xmm0
975975 ; AVX512-NEXT: vzeroupper
976976 ; AVX512-NEXT: retq
977 %1 = call float @llvm.experimental.vector.reduce.fadd.f32.v16f32(float undef, <16 x float> %a0)
977 %1 = call float @llvm.experimental.vector.reduce.v2.fadd.f32.v16f32(float undef, <16 x float> %a0)
978978 ret float %1
979979 }
980980
10031003 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm1[1,0]
10041004 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
10051005 ; AVX512-NEXT: retq
1006 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double %a0, <2 x double> %a1)
1006 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64(double %a0, <2 x double> %a1)
10071007 ret double %1
10081008 }
10091009
10411041 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
10421042 ; AVX512-NEXT: vzeroupper
10431043 ; AVX512-NEXT: retq
1044 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double %a0, <4 x double> %a1)
1044 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double %a0, <4 x double> %a1)
10451045 ret double %1
10461046 }
10471047
11001100 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
11011101 ; AVX512-NEXT: vzeroupper
11021102 ; AVX512-NEXT: retq
1103 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v8f64(double %a0, <8 x double> %a1)
1103 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v8f64(double %a0, <8 x double> %a1)
11041104 ret double %1
11051105 }
11061106
12011201 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
12021202 ; AVX512-NEXT: vzeroupper
12031203 ; AVX512-NEXT: retq
1204 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v16f64(double %a0, <16 x double> %a1)
1204 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v16f64(double %a0, <16 x double> %a1)
12051205 ret double %1
12061206 }
12071207
12331233 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
12341234 ; AVX512-NEXT: vaddsd %xmm0, %xmm1, %xmm0
12351235 ; AVX512-NEXT: retq
1236 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double 0.0, <2 x double> %a0)
1236 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64(double 0.0, <2 x double> %a0)
12371237 ret double %1
12381238 }
12391239
12741274 ; AVX512-NEXT: vaddsd %xmm0, %xmm1, %xmm0
12751275 ; AVX512-NEXT: vzeroupper
12761276 ; AVX512-NEXT: retq
1277 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double 0.0, <4 x double> %a0)
1277 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double 0.0, <4 x double> %a0)
12781278 ret double %1
12791279 }
12801280
13361336 ; AVX512-NEXT: vaddsd %xmm0, %xmm1, %xmm0
13371337 ; AVX512-NEXT: vzeroupper
13381338 ; AVX512-NEXT: retq
1339 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v8f64(double 0.0, <8 x double> %a0)
1339 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v8f64(double 0.0, <8 x double> %a0)
13401340 ret double %1
13411341 }
13421342
14391439 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
14401440 ; AVX512-NEXT: vzeroupper
14411441 ; AVX512-NEXT: retq
1442 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v16f64(double 0.0, <16 x double> %a0)
1442 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v16f64(double 0.0, <16 x double> %a0)
14431443 ret double %1
14441444 }
14451445
14651465 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
14661466 ; AVX512-NEXT: vaddsd {{.*}}(%rip), %xmm0, %xmm0
14671467 ; AVX512-NEXT: retq
1468 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double undef, <2 x double> %a0)
1468 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64(double undef, <2 x double> %a0)
14691469 ret double %1
14701470 }
14711471
15001500 ; AVX512-NEXT: vaddsd %xmm0, %xmm1, %xmm0
15011501 ; AVX512-NEXT: vzeroupper
15021502 ; AVX512-NEXT: retq
1503 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double undef, <4 x double> %a0)
1503 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double undef, <4 x double> %a0)
15041504 ret double %1
15051505 }
15061506
15561556 ; AVX512-NEXT: vaddsd %xmm0, %xmm1, %xmm0
15571557 ; AVX512-NEXT: vzeroupper
15581558 ; AVX512-NEXT: retq
1559 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v8f64(double undef, <8 x double> %a0)
1559 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v8f64(double undef, <8 x double> %a0)
15601560 ret double %1
15611561 }
15621562
16531653 ; AVX512-NEXT: vaddsd %xmm1, %xmm0, %xmm0
16541654 ; AVX512-NEXT: vzeroupper
16551655 ; AVX512-NEXT: retq
1656 %1 = call double @llvm.experimental.vector.reduce.fadd.f64.v16f64(double undef, <16 x double> %a0)
1656 %1 = call double @llvm.experimental.vector.reduce.v2.fadd.f64.v16f64(double undef, <16 x double> %a0)
16571657 ret double %1
16581658 }
16591659
1660 declare float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float, <2 x float>)
1661 declare float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float, <4 x float>)
1662 declare float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float, <8 x float>)
1663 declare float @llvm.experimental.vector.reduce.fadd.f32.v16f32(float, <16 x float>)
1664
1665 declare double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double, <2 x double>)
1666 declare double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double, <4 x double>)
1667 declare double @llvm.experimental.vector.reduce.fadd.f64.v8f64(double, <8 x double>)
1668 declare double @llvm.experimental.vector.reduce.fadd.f64.v16f64(double, <16 x double>)
1660 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v2f32(float, <2 x float>)
1661 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float, <4 x float>)
1662 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v8f32(float, <8 x float>)
1663 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v16f32(float, <16 x float>)
1664
1665 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v2f64(double, <2 x double>)
1666 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v4f64(double, <4 x double>)
1667 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v8f64(double, <8 x double>)
1668 declare double @llvm.experimental.vector.reduce.v2.fadd.f64.v16f64(double, <16 x double>)
1212 define float @test_v2f32(float %a0, <2 x float> %a1) {
1313 ; SSE2-LABEL: test_v2f32:
1414 ; SSE2: # %bb.0:
15 ; SSE2-NEXT: movaps %xmm1, %xmm0
16 ; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1],xmm1[2,3]
17 ; SSE2-NEXT: mulss %xmm1, %xmm0
15 ; SSE2-NEXT: movaps %xmm1, %xmm2
16 ; SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[1,1],xmm1[2,3]
17 ; SSE2-NEXT: mulss %xmm1, %xmm2
18 ; SSE2-NEXT: mulss %xmm2, %xmm0
1819 ; SSE2-NEXT: retq
1920 ;
2021 ; SSE41-LABEL: test_v2f32:
2122 ; SSE41: # %bb.0:
22 ; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
23 ; SSE41-NEXT: mulss %xmm1, %xmm0
23 ; SSE41-NEXT: movshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
24 ; SSE41-NEXT: mulss %xmm1, %xmm2
25 ; SSE41-NEXT: mulss %xmm2, %xmm0
2426 ; SSE41-NEXT: retq
2527 ;
2628 ; AVX-LABEL: test_v2f32:
2729 ; AVX: # %bb.0:
28 ; AVX-NEXT: vmovshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
29 ; AVX-NEXT: vmulss %xmm0, %xmm1, %xmm0
30 ; AVX-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
31 ; AVX-NEXT: vmulss %xmm2, %xmm1, %xmm1
32 ; AVX-NEXT: vmulss %xmm1, %xmm0, %xmm0
3033 ; AVX-NEXT: retq
3134 ;
3235 ; AVX512-LABEL: test_v2f32:
3336 ; AVX512: # %bb.0:
34 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
35 ; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0
36 ; AVX512-NEXT: retq
37 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v2f32(float %a0, <2 x float> %a1)
37 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
38 ; AVX512-NEXT: vmulss %xmm2, %xmm1, %xmm1
39 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
40 ; AVX512-NEXT: retq
41 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v2f32(float %a0, <2 x float> %a1)
3842 ret float %1
3943 }
4044
4448 ; SSE2-NEXT: movaps %xmm1, %xmm2
4549 ; SSE2-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
4650 ; SSE2-NEXT: mulps %xmm1, %xmm2
47 ; SSE2-NEXT: movaps %xmm2, %xmm0
48 ; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1],xmm2[2,3]
49 ; SSE2-NEXT: mulss %xmm2, %xmm0
51 ; SSE2-NEXT: movaps %xmm2, %xmm1
52 ; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1],xmm2[2,3]
53 ; SSE2-NEXT: mulss %xmm2, %xmm1
54 ; SSE2-NEXT: mulss %xmm1, %xmm0
5055 ; SSE2-NEXT: retq
5156 ;
5257 ; SSE41-LABEL: test_v4f32:
5459 ; SSE41-NEXT: movaps %xmm1, %xmm2
5560 ; SSE41-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
5661 ; SSE41-NEXT: mulps %xmm1, %xmm2
57 ; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm2[1,1,3,3]
58 ; SSE41-NEXT: mulss %xmm2, %xmm0
62 ; SSE41-NEXT: movshdup {{.*#+}} xmm1 = xmm2[1,1,3,3]
63 ; SSE41-NEXT: mulss %xmm2, %xmm1
64 ; SSE41-NEXT: mulss %xmm1, %xmm0
5965 ; SSE41-NEXT: retq
6066 ;
6167 ; AVX-LABEL: test_v4f32:
6268 ; AVX: # %bb.0:
63 ; AVX-NEXT: vpermilpd {{.*#+}} xmm0 = xmm1[1,0]
64 ; AVX-NEXT: vmulps %xmm0, %xmm1, %xmm0
65 ; AVX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
69 ; AVX-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
70 ; AVX-NEXT: vmulps %xmm2, %xmm1, %xmm1
71 ; AVX-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
72 ; AVX-NEXT: vmulss %xmm2, %xmm1, %xmm1
6673 ; AVX-NEXT: vmulss %xmm1, %xmm0, %xmm0
6774 ; AVX-NEXT: retq
6875 ;
6976 ; AVX512-LABEL: test_v4f32:
7077 ; AVX512: # %bb.0:
71 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm0 = xmm1[1,0]
72 ; AVX512-NEXT: vmulps %xmm0, %xmm1, %xmm0
73 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
74 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
75 ; AVX512-NEXT: retq
76 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float %a0, <4 x float> %a1)
78 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
79 ; AVX512-NEXT: vmulps %xmm2, %xmm1, %xmm1
80 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
81 ; AVX512-NEXT: vmulss %xmm2, %xmm1, %xmm1
82 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
83 ; AVX512-NEXT: retq
84 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float %a0, <4 x float> %a1)
7785 ret float %1
7886 }
7987
8492 ; SSE2-NEXT: movaps %xmm1, %xmm2
8593 ; SSE2-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
8694 ; SSE2-NEXT: mulps %xmm1, %xmm2
87 ; SSE2-NEXT: movaps %xmm2, %xmm0
88 ; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1],xmm2[2,3]
89 ; SSE2-NEXT: mulss %xmm2, %xmm0
95 ; SSE2-NEXT: movaps %xmm2, %xmm1
96 ; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1],xmm2[2,3]
97 ; SSE2-NEXT: mulss %xmm2, %xmm1
98 ; SSE2-NEXT: mulss %xmm1, %xmm0
9099 ; SSE2-NEXT: retq
91100 ;
92101 ; SSE41-LABEL: test_v8f32:
95104 ; SSE41-NEXT: movaps %xmm1, %xmm2
96105 ; SSE41-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
97106 ; SSE41-NEXT: mulps %xmm1, %xmm2
98 ; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm2[1,1,3,3]
99 ; SSE41-NEXT: mulss %xmm2, %xmm0
107 ; SSE41-NEXT: movshdup {{.*#+}} xmm1 = xmm2[1,1,3,3]
108 ; SSE41-NEXT: mulss %xmm2, %xmm1
109 ; SSE41-NEXT: mulss %xmm1, %xmm0
100110 ; SSE41-NEXT: retq
101111 ;
102112 ; AVX-LABEL: test_v8f32:
103113 ; AVX: # %bb.0:
104 ; AVX-NEXT: vextractf128 $1, %ymm1, %xmm0
105 ; AVX-NEXT: vmulps %xmm0, %xmm1, %xmm0
106 ; AVX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
107 ; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0
108 ; AVX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
114 ; AVX-NEXT: vextractf128 $1, %ymm1, %xmm2
115 ; AVX-NEXT: vmulps %xmm2, %xmm1, %xmm1
116 ; AVX-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
117 ; AVX-NEXT: vmulps %xmm2, %xmm1, %xmm1
118 ; AVX-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
119 ; AVX-NEXT: vmulss %xmm2, %xmm1, %xmm1
109120 ; AVX-NEXT: vmulss %xmm1, %xmm0, %xmm0
110121 ; AVX-NEXT: vzeroupper
111122 ; AVX-NEXT: retq
112123 ;
113124 ; AVX512-LABEL: test_v8f32:
114125 ; AVX512: # %bb.0:
115 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm0
116 ; AVX512-NEXT: vmulps %xmm0, %xmm1, %xmm0
117 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
118 ; AVX512-NEXT: vmulps %xmm1, %xmm0, %xmm0
119 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
120 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
121 ; AVX512-NEXT: vzeroupper
122 ; AVX512-NEXT: retq
123 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v8f32(float %a0, <8 x float> %a1)
126 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm2
127 ; AVX512-NEXT: vmulps %xmm2, %xmm1, %xmm1
128 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
129 ; AVX512-NEXT: vmulps %xmm2, %xmm1, %xmm1
130 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
131 ; AVX512-NEXT: vmulss %xmm2, %xmm1, %xmm1
132 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
133 ; AVX512-NEXT: vzeroupper
134 ; AVX512-NEXT: retq
135 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v8f32(float %a0, <8 x float> %a1)
124136 ret float %1
125137 }
126138
133145 ; SSE2-NEXT: movaps %xmm1, %xmm2
134146 ; SSE2-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
135147 ; SSE2-NEXT: mulps %xmm1, %xmm2
136 ; SSE2-NEXT: movaps %xmm2, %xmm0
137 ; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1],xmm2[2,3]
138 ; SSE2-NEXT: mulss %xmm2, %xmm0
148 ; SSE2-NEXT: movaps %xmm2, %xmm1
149 ; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1],xmm2[2,3]
150 ; SSE2-NEXT: mulss %xmm2, %xmm1
151 ; SSE2-NEXT: mulss %xmm1, %xmm0
139152 ; SSE2-NEXT: retq
140153 ;
141154 ; SSE41-LABEL: test_v16f32:
146159 ; SSE41-NEXT: movaps %xmm1, %xmm2
147160 ; SSE41-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
148161 ; SSE41-NEXT: mulps %xmm1, %xmm2
149 ; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm2[1,1,3,3]
150 ; SSE41-NEXT: mulss %xmm2, %xmm0
162 ; SSE41-NEXT: movshdup {{.*#+}} xmm1 = xmm2[1,1,3,3]
163 ; SSE41-NEXT: mulss %xmm2, %xmm1
164 ; SSE41-NEXT: mulss %xmm1, %xmm0
151165 ; SSE41-NEXT: retq
152166 ;
153167 ; AVX-LABEL: test_v16f32:
154168 ; AVX: # %bb.0:
155 ; AVX-NEXT: vmulps %ymm2, %ymm1, %ymm0
156 ; AVX-NEXT: vextractf128 $1, %ymm0, %xmm1
157 ; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0
158 ; AVX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
159 ; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0
160 ; AVX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
169 ; AVX-NEXT: vmulps %ymm2, %ymm1, %ymm1
170 ; AVX-NEXT: vextractf128 $1, %ymm1, %xmm2
171 ; AVX-NEXT: vmulps %xmm2, %xmm1, %xmm1
172 ; AVX-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
173 ; AVX-NEXT: vmulps %xmm2, %xmm1, %xmm1
174 ; AVX-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
175 ; AVX-NEXT: vmulss %xmm2, %xmm1, %xmm1
161176 ; AVX-NEXT: vmulss %xmm1, %xmm0, %xmm0
162177 ; AVX-NEXT: vzeroupper
163178 ; AVX-NEXT: retq
164179 ;
165180 ; AVX512-LABEL: test_v16f32:
166181 ; AVX512: # %bb.0:
167 ; AVX512-NEXT: vextractf64x4 $1, %zmm1, %ymm0
168 ; AVX512-NEXT: vmulps %zmm0, %zmm1, %zmm0
169 ; AVX512-NEXT: vextractf128 $1, %ymm0, %xmm1
170 ; AVX512-NEXT: vmulps %xmm1, %xmm0, %xmm0
171 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
172 ; AVX512-NEXT: vmulps %xmm1, %xmm0, %xmm0
173 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
174 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
175 ; AVX512-NEXT: vzeroupper
176 ; AVX512-NEXT: retq
177 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v16f32(float %a0, <16 x float> %a1)
182 ; AVX512-NEXT: vextractf64x4 $1, %zmm1, %ymm2
183 ; AVX512-NEXT: vmulps %zmm2, %zmm1, %zmm1
184 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm2
185 ; AVX512-NEXT: vmulps %xmm2, %xmm1, %xmm1
186 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
187 ; AVX512-NEXT: vmulps %xmm2, %xmm1, %xmm1
188 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
189 ; AVX512-NEXT: vmulss %xmm2, %xmm1, %xmm1
190 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
191 ; AVX512-NEXT: vzeroupper
192 ; AVX512-NEXT: retq
193 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v16f32(float %a0, <16 x float> %a1)
178194 ret float %1
179195 }
180196
208224 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
209225 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
210226 ; AVX512-NEXT: retq
211 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v2f32(float 1.0, <2 x float> %a0)
227 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v2f32(float 1.0, <2 x float> %a0)
212228 ret float %1
213229 }
214230
248264 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
249265 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
250266 ; AVX512-NEXT: retq
251 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float 1.0, <4 x float> %a0)
267 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.0, <4 x float> %a0)
252268 ret float %1
253269 }
254270
296312 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
297313 ; AVX512-NEXT: vzeroupper
298314 ; AVX512-NEXT: retq
299 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v8f32(float 1.0, <8 x float> %a0)
315 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v8f32(float 1.0, <8 x float> %a0)
300316 ret float %1
301317 }
302318
351367 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
352368 ; AVX512-NEXT: vzeroupper
353369 ; AVX512-NEXT: retq
354 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v16f32(float 1.0, <16 x float> %a0)
370 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v16f32(float 1.0, <16 x float> %a0)
355371 ret float %1
356372 }
357373
385401 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
386402 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
387403 ; AVX512-NEXT: retq
388 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v2f32(float undef, <2 x float> %a0)
404 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v2f32(float 1.0, <2 x float> %a0)
389405 ret float %1
390406 }
391407
425441 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
426442 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
427443 ; AVX512-NEXT: retq
428 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float undef, <4 x float> %a0)
444 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.0, <4 x float> %a0)
429445 ret float %1
430446 }
431447
473489 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
474490 ; AVX512-NEXT: vzeroupper
475491 ; AVX512-NEXT: retq
476 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v8f32(float undef, <8 x float> %a0)
492 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v8f32(float 1.0, <8 x float> %a0)
477493 ret float %1
478494 }
479495
528544 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
529545 ; AVX512-NEXT: vzeroupper
530546 ; AVX512-NEXT: retq
531 %1 = call fast float @llvm.experimental.vector.reduce.fmul.f32.v16f32(float undef, <16 x float> %a0)
547 %1 = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v16f32(float 1.0, <16 x float> %a0)
532548 ret float %1
533549 }
534550
539555 define double @test_v2f64(double %a0, <2 x double> %a1) {
540556 ; SSE-LABEL: test_v2f64:
541557 ; SSE: # %bb.0:
542 ; SSE-NEXT: movapd %xmm1, %xmm0
543 ; SSE-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1]
544 ; SSE-NEXT: mulsd %xmm1, %xmm0
558 ; SSE-NEXT: movapd %xmm1, %xmm2
559 ; SSE-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
560 ; SSE-NEXT: mulsd %xmm1, %xmm2
561 ; SSE-NEXT: mulsd %xmm2, %xmm0
545562 ; SSE-NEXT: retq
546563 ;
547564 ; AVX-LABEL: test_v2f64:
548565 ; AVX: # %bb.0:
549 ; AVX-NEXT: vpermilpd {{.*#+}} xmm0 = xmm1[1,0]
550 ; AVX-NEXT: vmulsd %xmm0, %xmm1, %xmm0
566 ; AVX-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
567 ; AVX-NEXT: vmulsd %xmm2, %xmm1, %xmm1
568 ; AVX-NEXT: vmulsd %xmm1, %xmm0, %xmm0
551569 ; AVX-NEXT: retq
552570 ;
553571 ; AVX512-LABEL: test_v2f64:
554572 ; AVX512: # %bb.0:
555 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm0 = xmm1[1,0]
556 ; AVX512-NEXT: vmulsd %xmm0, %xmm1, %xmm0
557 ; AVX512-NEXT: retq
558 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v2f64(double %a0, <2 x double> %a1)
573 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
574 ; AVX512-NEXT: vmulsd %xmm2, %xmm1, %xmm1
575 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
576 ; AVX512-NEXT: retq
577 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v2f64(double %a0, <2 x double> %a1)
559578 ret double %1
560579 }
561580
563582 ; SSE-LABEL: test_v4f64:
564583 ; SSE: # %bb.0:
565584 ; SSE-NEXT: mulpd %xmm2, %xmm1
566 ; SSE-NEXT: movapd %xmm1, %xmm0
567 ; SSE-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1]
568 ; SSE-NEXT: mulsd %xmm1, %xmm0
585 ; SSE-NEXT: movapd %xmm1, %xmm2
586 ; SSE-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
587 ; SSE-NEXT: mulsd %xmm1, %xmm2
588 ; SSE-NEXT: mulsd %xmm2, %xmm0
569589 ; SSE-NEXT: retq
570590 ;
571591 ; AVX-LABEL: test_v4f64:
572592 ; AVX: # %bb.0:
573 ; AVX-NEXT: vextractf128 $1, %ymm1, %xmm0
574 ; AVX-NEXT: vmulpd %xmm0, %xmm1, %xmm0
575 ; AVX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
593 ; AVX-NEXT: vextractf128 $1, %ymm1, %xmm2
594 ; AVX-NEXT: vmulpd %xmm2, %xmm1, %xmm1
595 ; AVX-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
596 ; AVX-NEXT: vmulsd %xmm2, %xmm1, %xmm1
576597 ; AVX-NEXT: vmulsd %xmm1, %xmm0, %xmm0
577598 ; AVX-NEXT: vzeroupper
578599 ; AVX-NEXT: retq
579600 ;
580601 ; AVX512-LABEL: test_v4f64:
581602 ; AVX512: # %bb.0:
582 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm0
583 ; AVX512-NEXT: vmulpd %xmm0, %xmm1, %xmm0
584 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
585 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
586 ; AVX512-NEXT: vzeroupper
587 ; AVX512-NEXT: retq
588 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v4f64(double %a0, <4 x double> %a1)
603 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm2
604 ; AVX512-NEXT: vmulpd %xmm2, %xmm1, %xmm1
605 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
606 ; AVX512-NEXT: vmulsd %xmm2, %xmm1, %xmm1
607 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
608 ; AVX512-NEXT: vzeroupper
609 ; AVX512-NEXT: retq
610 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v4f64(double %a0, <4 x double> %a1)
589611 ret double %1
590612 }
591613
595617 ; SSE-NEXT: mulpd %xmm4, %xmm2
596618 ; SSE-NEXT: mulpd %xmm3, %xmm1
597619 ; SSE-NEXT: mulpd %xmm2, %xmm1
598 ; SSE-NEXT: movapd %xmm1, %xmm0
599 ; SSE-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1]
600 ; SSE-NEXT: mulsd %xmm1, %xmm0
620 ; SSE-NEXT: movapd %xmm1, %xmm2
621 ; SSE-NEXT: unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
622 ; SSE-NEXT: mulsd %xmm1, %xmm2
623 ; SSE-NEXT: mulsd %xmm2, %xmm0
601624 ; SSE-NEXT: retq
602625 ;
603626 ; AVX-LABEL: test_v8f64:
604627 ; AVX: # %bb.0:
605 ; AVX-NEXT: vmulpd %ymm2, %ymm1, %ymm0
606 ; AVX-NEXT: vextractf128 $1, %ymm0, %xmm1
607 ; AVX-NEXT: vmulpd %xmm1, %xmm0, %xmm0
608 ; AVX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
628 ; AVX-NEXT: vmulpd %ymm2, %ymm1, %ymm1
629 ; AVX-NEXT: vextractf128 $1, %ymm1, %xmm2
630 ; AVX-NEXT: vmulpd %xmm2, %xmm1, %xmm1
631 ; AVX-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
632 ; AVX-NEXT: vmulsd %xmm2, %xmm1, %xmm1
609633 ; AVX-NEXT: vmulsd %xmm1, %xmm0, %xmm0
610634 ; AVX-NEXT: vzeroupper
611635 ; AVX-NEXT: retq
612636 ;
613637 ; AVX512-LABEL: test_v8f64:
614638 ; AVX512: # %bb.0:
615 ; AVX512-NEXT: vextractf64x4 $1, %zmm1, %ymm0
616 ; AVX512-NEXT: vmulpd %zmm0, %zmm1, %zmm0
617 ; AVX512-NEXT: vextractf128 $1, %ymm0, %xmm1
618 ; AVX512-NEXT: vmulpd %xmm1, %xmm0, %xmm0
619 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
620 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
621 ; AVX512-NEXT: vzeroupper
622 ; AVX512-NEXT: retq
623 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v8f64(double %a0, <8 x double> %a1)
639 ; AVX512-NEXT: vextractf64x4 $1, %zmm1, %ymm2
640 ; AVX512-NEXT: vmulpd %zmm2, %zmm1, %zmm1
641 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm2
642 ; AVX512-NEXT: vmulpd %xmm2, %xmm1, %xmm1
643 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
644 ; AVX512-NEXT: vmulsd %xmm2, %xmm1, %xmm1
645 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
646 ; AVX512-NEXT: vzeroupper
647 ; AVX512-NEXT: retq
648 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v8f64(double %a0, <8 x double> %a1)
624649 ret double %1
625650 }
626651
634659 ; SSE-NEXT: mulpd {{[0-9]+}}(%rsp), %xmm4
635660 ; SSE-NEXT: mulpd %xmm2, %xmm4
636661 ; SSE-NEXT: mulpd %xmm1, %xmm4
637 ; SSE-NEXT: movapd %xmm4, %xmm0
638 ; SSE-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm4[1]
639 ; SSE-NEXT: mulsd %xmm4, %xmm0
662 ; SSE-NEXT: movapd %xmm4, %xmm1
663 ; SSE-NEXT: unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm4[1]
664 ; SSE-NEXT: mulsd %xmm4, %xmm1
665 ; SSE-NEXT: mulsd %xmm1, %xmm0
640666 ; SSE-NEXT: retq
641667 ;
642668 ; AVX-LABEL: test_v16f64:
643669 ; AVX: # %bb.0:
644 ; AVX-NEXT: vmulpd %ymm4, %ymm2, %ymm0
670 ; AVX-NEXT: vmulpd %ymm4, %ymm2, %ymm2
645671 ; AVX-NEXT: vmulpd %ymm3, %ymm1, %ymm1
646 ; AVX-NEXT: vmulpd %ymm0, %ymm1, %ymm0
647 ; AVX-NEXT: vextractf128 $1, %ymm0, %xmm1
648 ; AVX-NEXT: vmulpd %xmm1, %xmm0, %xmm0
649 ; AVX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
672 ; AVX-NEXT: vmulpd %ymm2, %ymm1, %ymm1
673 ; AVX-NEXT: vextractf128 $1, %ymm1, %xmm2
674 ; AVX-NEXT: vmulpd %xmm2, %xmm1, %xmm1
675 ; AVX-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
676 ; AVX-NEXT: vmulsd %xmm2, %xmm1, %xmm1
650677 ; AVX-NEXT: vmulsd %xmm1, %xmm0, %xmm0
651678 ; AVX-NEXT: vzeroupper
652679 ; AVX-NEXT: retq
653680 ;
654681 ; AVX512-LABEL: test_v16f64:
655682 ; AVX512: # %bb.0:
656 ; AVX512-NEXT: vmulpd %zmm2, %zmm1, %zmm0
657 ; AVX512-NEXT: vextractf64x4 $1, %zmm0, %ymm1
658 ; AVX512-NEXT: vmulpd %zmm1, %zmm0, %zmm0
659 ; AVX512-NEXT: vextractf128 $1, %ymm0, %xmm1
660 ; AVX512-NEXT: vmulpd %xmm1, %xmm0, %xmm0
661 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
662 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
663 ; AVX512-NEXT: vzeroupper
664 ; AVX512-NEXT: retq
665 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v16f64(double %a0, <16 x double> %a1)
683 ; AVX512-NEXT: vmulpd %zmm2, %zmm1, %zmm1
684 ; AVX512-NEXT: vextractf64x4 $1, %zmm1, %ymm2
685 ; AVX512-NEXT: vmulpd %zmm2, %zmm1, %zmm1
686 ; AVX512-NEXT: vextractf128 $1, %ymm1, %xmm2
687 ; AVX512-NEXT: vmulpd %xmm2, %xmm1, %xmm1
688 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
689 ; AVX512-NEXT: vmulsd %xmm2, %xmm1, %xmm1
690 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
691 ; AVX512-NEXT: vzeroupper
692 ; AVX512-NEXT: retq
693 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v16f64(double %a0, <16 x double> %a1)
666694 ret double %1
667695 }
668696
690718 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
691719 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
692720 ; AVX512-NEXT: retq
693 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v2f64(double 1.0, <2 x double> %a0)
721 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v2f64(double 1.0, <2 x double> %a0)
694722 ret double %1
695723 }
696724
721749 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
722750 ; AVX512-NEXT: vzeroupper
723751 ; AVX512-NEXT: retq
724 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v4f64(double 1.0, <4 x double> %a0)
752 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v4f64(double 1.0, <4 x double> %a0)
725753 ret double %1
726754 }
727755
757785 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
758786 ; AVX512-NEXT: vzeroupper
759787 ; AVX512-NEXT: retq
760 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v8f64(double 1.0, <8 x double> %a0)
788 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v8f64(double 1.0, <8 x double> %a0)
761789 ret double %1
762790 }
763791
799827 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
800828 ; AVX512-NEXT: vzeroupper
801829 ; AVX512-NEXT: retq
802 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v16f64(double 1.0, <16 x double> %a0)
830 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v16f64(double 1.0, <16 x double> %a0)
803831 ret double %1
804832 }
805833
827855 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
828856 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
829857 ; AVX512-NEXT: retq
830 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v2f64(double undef, <2 x double> %a0)
858 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v2f64(double 1.0, <2 x double> %a0)
831859 ret double %1
832860 }
833861
858886 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
859887 ; AVX512-NEXT: vzeroupper
860888 ; AVX512-NEXT: retq
861 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v4f64(double undef, <4 x double> %a0)
889 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v4f64(double 1.0, <4 x double> %a0)
862890 ret double %1
863891 }
864892
894922 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
895923 ; AVX512-NEXT: vzeroupper
896924 ; AVX512-NEXT: retq
897 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v8f64(double undef, <8 x double> %a0)
925 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v8f64(double 1.0, <8 x double> %a0)
898926 ret double %1
899927 }
900928
936964 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
937965 ; AVX512-NEXT: vzeroupper
938966 ; AVX512-NEXT: retq
939 %1 = call fast double @llvm.experimental.vector.reduce.fmul.f64.v16f64(double undef, <16 x double> %a0)
940 ret double %1
941 }
942
943 declare float @llvm.experimental.vector.reduce.fmul.f32.v2f32(float, <2 x float>)
944 declare float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float, <4 x float>)
945 declare float @llvm.experimental.vector.reduce.fmul.f32.v8f32(float, <8 x float>)
946 declare float @llvm.experimental.vector.reduce.fmul.f32.v16f32(float, <16 x float>)
947
948 declare double @llvm.experimental.vector.reduce.fmul.f64.v2f64(double, <2 x double>)
949 declare double @llvm.experimental.vector.reduce.fmul.f64.v4f64(double, <4 x double>)
950 declare double @llvm.experimental.vector.reduce.fmul.f64.v8f64(double, <8 x double>)
951 declare double @llvm.experimental.vector.reduce.fmul.f64.v16f64(double, <16 x double>)
967 %1 = call fast double @llvm.experimental.vector.reduce.v2.fmul.f64.v16f64(double 1.0, <16 x double> %a0)
968 ret double %1
969 }
970
971 declare float @llvm.experimental.vector.reduce.v2.fmul.f32.v2f32(float, <2 x float>)
972 declare float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float, <4 x float>)
973 declare float @llvm.experimental.vector.reduce.v2.fmul.f32.v8f32(float, <8 x float>)
974 declare float @llvm.experimental.vector.reduce.v2.fmul.f32.v16f32(float, <16 x float>)
975
976 declare double @llvm.experimental.vector.reduce.v2.fmul.f64.v2f64(double, <2 x double>)
977 declare double @llvm.experimental.vector.reduce.v2.fmul.f64.v4f64(double, <4 x double>)
978 declare double @llvm.experimental.vector.reduce.v2.fmul.f64.v8f64(double, <8 x double>)
979 declare double @llvm.experimental.vector.reduce.v2.fmul.f64.v16f64(double, <16 x double>)
3737 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm1[1,1,3,3]
3838 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
3939 ; AVX512-NEXT: retq
40 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v2f32(float %a0, <2 x float> %a1)
40 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v2f32(float %a0, <2 x float> %a1)
4141 ret float %1
4242 }
4343
8888 ; AVX512-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[3,1,2,3]
8989 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
9090 ; AVX512-NEXT: retq
91 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float %a0, <4 x float> %a1)
91 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float %a0, <4 x float> %a1)
9292 ret float %1
9393 }
9494
174174 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
175175 ; AVX512-NEXT: vzeroupper
176176 ; AVX512-NEXT: retq
177 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v8f32(float %a0, <8 x float> %a1)
177 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v8f32(float %a0, <8 x float> %a1)
178178 ret float %1
179179 }
180180
325325 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
326326 ; AVX512-NEXT: vzeroupper
327327 ; AVX512-NEXT: retq
328 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v16f32(float %a0, <16 x float> %a1)
328 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v16f32(float %a0, <16 x float> %a1)
329329 ret float %1
330330 }
331331
359359 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
360360 ; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm0
361361 ; AVX512-NEXT: retq
362 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v2f32(float 1.0, <2 x float> %a0)
362 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v2f32(float 1.0, <2 x float> %a0)
363363 ret float %1
364364 }
365365
406406 ; AVX512-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
407407 ; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0
408408 ; AVX512-NEXT: retq
409 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float 1.0, <4 x float> %a0)
409 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.0, <4 x float> %a0)
410410 ret float %1
411411 }
412412
488488 ; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0
489489 ; AVX512-NEXT: vzeroupper
490490 ; AVX512-NEXT: retq
491 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v8f32(float 1.0, <8 x float> %a0)
491 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v8f32(float 1.0, <8 x float> %a0)
492492 ret float %1
493493 }
494494
635635 ; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0
636636 ; AVX512-NEXT: vzeroupper
637637 ; AVX512-NEXT: retq
638 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v16f32(float 1.0, <16 x float> %a0)
638 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v16f32(float 1.0, <16 x float> %a0)
639639 ret float %1
640640 }
641641
667667 ; AVX512-NEXT: vmovshdup {{.*#+}} xmm0 = xmm0[1,1,3,3]
668668 ; AVX512-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
669669 ; AVX512-NEXT: retq
670 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v2f32(float undef, <2 x float> %a0)
670 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v2f32(float undef, <2 x float> %a0)
671671 ret float %1
672672 }
673673
714714 ; AVX512-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
715715 ; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0
716716 ; AVX512-NEXT: retq
717 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float undef, <4 x float> %a0)
717 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float undef, <4 x float> %a0)
718718 ret float %1
719719 }
720720
796796 ; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0
797797 ; AVX512-NEXT: vzeroupper
798798 ; AVX512-NEXT: retq
799 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v8f32(float undef, <8 x float> %a0)
799 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v8f32(float undef, <8 x float> %a0)
800800 ret float %1
801801 }
802802
943943 ; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0
944944 ; AVX512-NEXT: vzeroupper
945945 ; AVX512-NEXT: retq
946 %1 = call float @llvm.experimental.vector.reduce.fmul.f32.v16f32(float undef, <16 x float> %a0)
946 %1 = call float @llvm.experimental.vector.reduce.v2.fmul.f32.v16f32(float undef, <16 x float> %a0)
947947 ret float %1
948948 }
949949
972972 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm1[1,0]
973973 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
974974 ; AVX512-NEXT: retq
975 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v2f64(double %a0, <2 x double> %a1)
975 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v2f64(double %a0, <2 x double> %a1)
976976 ret double %1
977977 }
978978
10101010 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
10111011 ; AVX512-NEXT: vzeroupper
10121012 ; AVX512-NEXT: retq
1013 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v4f64(double %a0, <4 x double> %a1)
1013 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v4f64(double %a0, <4 x double> %a1)
10141014 ret double %1
10151015 }
10161016
10691069 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
10701070 ; AVX512-NEXT: vzeroupper
10711071 ; AVX512-NEXT: retq
1072 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v8f64(double %a0, <8 x double> %a1)
1072 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v8f64(double %a0, <8 x double> %a1)
10731073 ret double %1
10741074 }
10751075
11701170 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
11711171 ; AVX512-NEXT: vzeroupper
11721172 ; AVX512-NEXT: retq
1173 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v16f64(double %a0, <16 x double> %a1)
1173 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v16f64(double %a0, <16 x double> %a1)
11741174 ret double %1
11751175 }
11761176
11981198 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
11991199 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
12001200 ; AVX512-NEXT: retq
1201 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v2f64(double 1.0, <2 x double> %a0)
1201 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v2f64(double 1.0, <2 x double> %a0)
12021202 ret double %1
12031203 }
12041204
12351235 ; AVX512-NEXT: vmulsd %xmm0, %xmm1, %xmm0
12361236 ; AVX512-NEXT: vzeroupper
12371237 ; AVX512-NEXT: retq
1238 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v4f64(double 1.0, <4 x double> %a0)
1238 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v4f64(double 1.0, <4 x double> %a0)
12391239 ret double %1
12401240 }
12411241
12931293 ; AVX512-NEXT: vmulsd %xmm0, %xmm1, %xmm0
12941294 ; AVX512-NEXT: vzeroupper
12951295 ; AVX512-NEXT: retq
1296 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v8f64(double 1.0, <8 x double> %a0)
1296 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v8f64(double 1.0, <8 x double> %a0)
12971297 ret double %1
12981298 }
12991299
13911391 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
13921392 ; AVX512-NEXT: vzeroupper
13931393 ; AVX512-NEXT: retq
1394 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v16f64(double 1.0, <16 x double> %a0)
1394 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v16f64(double 1.0, <16 x double> %a0)
13951395 ret double %1
13961396 }
13971397
14171417 ; AVX512-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
14181418 ; AVX512-NEXT: vmulsd {{.*}}(%rip), %xmm0, %xmm0
14191419 ; AVX512-NEXT: retq
1420 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v2f64(double undef, <2 x double> %a0)
1420 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v2f64(double undef, <2 x double> %a0)
14211421 ret double %1
14221422 }
14231423
14521452 ; AVX512-NEXT: vmulsd %xmm0, %xmm1, %xmm0
14531453 ; AVX512-NEXT: vzeroupper
14541454 ; AVX512-NEXT: retq
1455 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v4f64(double undef, <4 x double> %a0)
1455 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v4f64(double undef, <4 x double> %a0)
14561456 ret double %1
14571457 }
14581458
15081508 ; AVX512-NEXT: vmulsd %xmm0, %xmm1, %xmm0
15091509 ; AVX512-NEXT: vzeroupper
15101510 ; AVX512-NEXT: retq
1511 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v8f64(double undef, <8 x double> %a0)
1511 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v8f64(double undef, <8 x double> %a0)
15121512 ret double %1
15131513 }
15141514
16051605 ; AVX512-NEXT: vmulsd %xmm1, %xmm0, %xmm0
16061606 ; AVX512-NEXT: vzeroupper
16071607 ; AVX512-NEXT: retq
1608 %1 = call double @llvm.experimental.vector.reduce.fmul.f64.v16f64(double undef, <16 x double> %a0)
1608 %1 = call double @llvm.experimental.vector.reduce.v2.fmul.f64.v16f64(double undef, <16 x double> %a0)
16091609 ret double %1
16101610 }
16111611
1612 declare float @llvm.experimental.vector.reduce.fmul.f32.v2f32(float, <2 x float>)
1613 declare float @llvm.experimental.vector.reduce.fmul.f32.v4f32(float, <4 x float>)
1614 declare float @llvm.experimental.vector.reduce.fmul.f32.v8f32(float, <8 x float>)
1615 declare float @llvm.experimental.vector.reduce.fmul.f32.v16f32(float, <16 x float>)
1616
1617 declare double @llvm.experimental.vector.reduce.fmul.f64.v2f64(double, <2 x double>)
1618 declare double @llvm.experimental.vector.reduce.fmul.f64.v4f64(double, <4 x double>)
1619 declare double @llvm.experimental.vector.reduce.fmul.f64.v8f64(double, <8 x double>)
1620 declare double @llvm.experimental.vector.reduce.fmul.f64.v16f64(double, <16 x double>)
1612 declare float @llvm.experimental.vector.reduce.v2.fmul.f32.v2f32(float, <2 x float>)
1613 declare float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float, <4 x float>)
1614 declare float @llvm.experimental.vector.reduce.v2.fmul.f32.v8f32(float, <8 x float>)
1615 declare float @llvm.experimental.vector.reduce.v2.fmul.f32.v16f32(float, <16 x float>)
1616
1617 declare double @llvm.experimental.vector.reduce.v2.fmul.f64.v2f64(double, <2 x double>)
1618 declare double @llvm.experimental.vector.reduce.v2.fmul.f64.v4f64(double, <4 x double>)
1619 declare double @llvm.experimental.vector.reduce.v2.fmul.f64.v8f64(double, <8 x double>)
1620 declare double @llvm.experimental.vector.reduce.v2.fmul.f64.v16f64(double, <16 x double>)