一个400GB的表,一个查询 - 需要调整的想法(SQL2005)

作者：ar_wen2402851455 | 2023-09-02 10:17

如何解决《一个400GB的表,一个查询-需要调整的想法(SQL2005)》经验，为你挑选了5个好方法。

我有一个大表,我想优化.我正在使用MS-SQL 2005服务器.我将尝试描述它是如何使用的,如果有人有任何建议我会非常感激.

该表约为400GB,每天插入1亿行和100万行.该表有8列,1列数据和7列用于查找/排序.

 k1 k2 k3 k4 k5 k6 k7 d1

哪里

 k1: varchar(3), primary key - clustered index, 10 possible values
 k2: bigint, primary key - clustered index, total rows/10 possible values
 k3: int, 10 possible values
 k4: money, 100 possible values
 k5: bool
 k6: bool
 k7: DateTime

只运行一个选择查询,如下所示:

 SELECT TOP(g) d1 FROM table WITH(NOLOCK)
  WHERE k1 = a
  AND k3 = c
  AND k4 = d
  AND k5 = e
  AND k6 = f
  ORDER BY k7

其中g =大约1百万这个查询我们每天运行大约10次(通常在插入时发生)并且大约需要5-30分钟.

所以我目前在两个主键列上只有一个聚簇索引.我的问题是:我应该添加哪些索引来改善此查询的性能？

每列的单独索引是一个不错的选择吗？我认为单个索引会占用大约5-8GB.数据库服务器总共有8GB RAM.

请不要说最好的事情就是试验.这类似于'我不知道,自己动手':)

任何提示非常感谢!

由doofledorfer编辑 -

你已经在这里引起了过早优化的爆发,如果不是直接的建议,"最好的事情就是试验".如果您需要有用的帮助,您需要澄清一些问题.

- doofledorfer

编辑:关于迄今为止的帖子的评论现在发布在下面以及查询计划 - Flibble先生

你可能是I/O绑定的

是的,它不受CPU限制.磁盘访问量很高.似乎使用了所有可用的RAM.是否明智地使用还有待观察.

您说您无法拆分数据,因为使用了所有数据:不可能

I mean that all data is used at some point - not that all data is used by each user in each query. I can certainly split the data but, so far, I don't understand why partitioning the table is any better than using a clustered index.

Why did you choose these types VARCHAR probably should have been INT as it can only be a few values. The rest are sensible enough, Money represents a money value in real life and bigint is an ID, and the bools are onny, offy type things :)

By any chance we could get have a look the insert statement, or TSQL or the bulkinsert

TSQL.它基本上是INSERT INTO表VALUES(k1,k2,k3,k4,k5,k6,d1).唯一有趣的是,尝试了许多重复插入,并使用k1和k2 PK约束来防止重复数据进入数据库.我相信在设计时(现在)这是一个快速的方式来找出重复的数据.

你能说出插件发生的频率吗？每隔10分钟左右,插件运行(ADO.NET)可能一次10K,需要几分钟.我估计目前整整一天的插入占当天40%的时间.

DateTime字段是否包含插入号的日期.实际上有另一个DateTime列,但是在任何SELECT查询中都没有检索它,所以为了简单起见我没有提到它.

你是怎么来到这个更多的人一天思考.

if you're interested only in the last data, deleting/archiving the useless data could make sense (start from scratch every morning)

I am not interested in recent data only. A query may select some of the very first data that was inserted into the table all the way up to data inserted minutes ago. But as the data is filtered this does not mean that all the data in the DB is requested in that query.

if there is only one "inserter" and only one "reader", you may want to switch to a specialised type (hashmap/list/deque/stack) or something more elaborated, in a programming language.

I will probably stick with MSSQL for the moment. It's not broke yet, just a little slow.

liggett78, do you suggest a clustered index on columns k1,k4,k5,k6,k3 or a non-clustered index on those columns?

My main question right now is should I extend the current clustered index to contain k4 also (this is the col with next most possible values) or should I just add a non-clustered index to k4.

Would adding all k1-k6 to a clustered index be an option? Then have a separate non-clustered index on the DateTime column for the ORDER BY? Am I correct in thinking that this would not cause any major increase in DB size but will only affect insert times. Can anyone guesstimate the effect this will have on inserts?

I think that if adding indexes to all the columns will double the DB size then it is not viable without large (ie. hardware) changes.

The following plan was run with an index (non clustered) on the DATE column.

EDIT: Not sure if you can see the XML below so here is a link to it: http://conormccarthy.com/box/queryplan.sqlplan.txt

一个400GB的表,一个查询 - 需要调整的想法(SQL2005)

在numpy中索引多个非相邻范围

如果没有子集总和等于给定值，则返回最接近该值的子集总和

获取IntelliJ以识别AnnotationProcessor生成的类

如何从spring数据代码中清晰地(物理地)分离域层？

使用递归F#运行函数

是否可以从布局中获取视图名称？

Swift:使用Failable Initializer从JSON创建模型类

发送带有电子webview的发布请求

为什么编程语言不使用简化的布尔表达式？

如何使用按钮启用/禁用文本框

对于标准库实现来说,专门化一个带有子概念的概念上的函数是否合法？

如果我将JavaScript事件绑定到一个元素,然后删除该元素,该事件会发生什么？

放大Chrome后,svg模式中的图像变得模糊

python中的文字是什么？

如何使用gitlab进行let的加密？

如何在bash数组的所有条目中用下划线替换空格

Ember 2:截断文本并添加省略号

休眠验证+自定义约束+ java.lang.NullPointerException

你如何调试react-native使用的自定义android模块

npm递归更新包