所有,
我正在编写一些性能敏感的代码,包括一个可以执行大量交叉产品的3d矢量类.作为一名长期的C++程序员,我了解宏的弊端和内联函数的各种好处.我一直认为内联函数应该与宏的速度大致相同.然而,在性能测试宏与内联函数中,我发现了一个有趣的发现,我希望是因为我在某处犯了一个愚蠢的错误:我的函数的宏版本似乎是内联版本的8倍以上!
首先,一个简单的矢量类的荒谬修剪版本:
class Vector3d { public: double m_tX, m_tY, m_tZ; Vector3d() : m_tX(0), m_tY(0), m_tZ(0) {} Vector3d(const double &tX, const double &tY, const double &tZ): m_tX(tX), m_tY(tY), m_tZ(tZ) {} static inline void CrossAndAssign ( const Vector3d& cV1, const Vector3d& cV2, Vector3d& cV ) { cV.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY; cV.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ; cV.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX; } #define FastVectorCrossAndAssign(cV1,cV2,cVOut) { \ cVOut.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY; \ cVOut.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ; \ cVOut.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX; } };
这是我的示例基准测试代码:
Vector3d right;
Vector3d forward(1.0, 2.2, 3.6);
Vector3d up(3.2, 1.4, 23.6);
clock_t start = clock(); for (long l=0; l < 100000000; l++) { Vector3d::CrossAndAssign(forward, up, right); // static inline version } clock_t end = clock(); std::cout << end - start << endl; clock_t start2 = clock(); for (long l=0; l<100000000; l++) { FastVectorCrossAndAssign(forward, up, right); // macro version } clock_t end2 = clock(); std::cout << end2 - start2 << endl;
Vector3d right;
Vector3d forward(1.0, 2.2, 3.6);
Vector3d up(3.2, 1.4, 23.6);
class Vector3d
{
public:
double m_tX, m_tY, m_tZ;
Vector3d() : m_tX(0), m_tY(0), m_tZ(0) {}
Vector3d(const double &tX, const double &tY, const double &tZ):
m_tX(tX), m_tY(tY), m_tZ(tZ) {}
static inline void CrossAndAssign ( const Vector3d& cV1, const Vector3d& cV2, Vector3d& cV )
{
cV.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY;
cV.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ;
cV.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX;
}
#define FastVectorCrossAndAssign(cV1,cV2,cVOut) { \
cVOut.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY; \
cVOut.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ; \
cVOut.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX; }
};
最终结果:完全关闭优化后,内联版本需要3200个刻度,宏版本500刻度...开启优化(/ O2,最大化速度和其他速度调整),我可以获得内联版本到1100蜱,这是更好,但仍然不一样.
所以我呼吁你们所有人:这是真的吗?我在某个地方犯过一个愚蠢的错误吗?或者内联函数真的这么慢 - 如果是这样,为什么呢?
注意:发布此答案后,编辑原始问题以删除此问题.我会留下答案,因为它在几个层面上都很有启发性.
循环因他们所做的不同而不同!
如果我们手动扩展宏,我们得到:
for (long l=0; l<100000000; l++) right.m_tX = forward.m_tY * up.m_tZ - forward.m_tZ * up.m_tY; right.m_tY = forward.m_tZ * up.m_tX - forward.m_tX * up.m_tZ; right.m_tZ = forward.m_tX * up.m_tY - forward.m_tY * up.m_tX;
注意大括号的缺席.所以编译器将其视为:
for (long l=0; l<100000000; l++) { right.m_tX = forward.m_tY * up.m_tZ - forward.m_tZ * up.m_tY; } right.m_tY = forward.m_tZ * up.m_tX - forward.m_tX * up.m_tZ; right.m_tZ = forward.m_tX * up.m_tY - forward.m_tY * up.m_tX;
这显然是为什么第二个循环如此快得多.
Udpate:这也是为什么宏是邪恶的一个很好的例子:)