sql – 连续重复/重复的有序计数

发布时间：2021-01-24 10:41:25 所属栏目：MsSql教程来源：网络整理

导读：副标题#e# 我非常怀疑我是以最有效的方式做到这一点,这就是我在这里标记plpgsql的原因.对于一千个测量系统,我需要在20亿行上运行它. 您有测量系统,当它们失去连接时经常报告先前的值,并且它们经常失去连接,但有时会长时间失去连接.您需要聚合,但是当您这样

副标题[/!--empirenews.page--]

我非常怀疑我是以最有效的方式做到这一点,这就是我在这里标记plpgsql的原因.对于一千个测量系统,我需要在20亿行上运行它.

您有测量系统,当它们失去连接时经常报告先前的值,并且它们经常失去连接,但有时会长时间失去连接.您需要聚合,但是当您这样做时,您需要查看它重复多长时间并根据该信息制作各种过滤器.假设你在汽车上测量mpg,但它在20英里/加仑的速度下停留一小时,而不是在20.1左右,依此类推.你会想要在卡住时评估准确性.您还可以放置一些替代规则来查找汽车在高速公路上的情况,并且通过窗口功能,您可以生成汽车的“状态”并且可以分组.无需再费周折：

--here's my data,you have different systems,the time of measurement,and the actual measurement
--as well,the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
    (
    select 
    system_measured,time_of_measurement,measurement,case when 
     measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc) 
     then 1 else 0 end as repeat
    FROM
    (
    SELECT 5 as measurement,1 as time_of_measurement,1 as system_measured
    UNION
    SELECT 150 as measurement,2 as time_of_measurement,1 as system_measured
    UNION
    SELECT 5 as measurement,3 as time_of_measurement,4 as time_of_measurement,2 as system_measured
    UNION
    SELECT 5 as measurement,2 as system_measured
    UNION
    SELECT 150 as measurement,5 as time_of_measurement,6 as time_of_measurement,7 as time_of_measurement,8 as time_of_measurement,2 as system_measured
    ) as data
) as data;

--unfortunately you can't have window functions within window functions,so I had to break it down into subquery
--what we need is something to partion on,the 'state' of the system if you will,so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
    (
    select 
    *,sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
    from cumulative_repeat_calculator_data
    order by system_measured,time_of_measurement
) as data;

--finally,the query. I didn't bother showing my desired output,because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating,and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating,for example  
select *,case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system,system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured,time_of_measurement

那么,为了在巨大的桌子上运行它,或者你会使用什么替代工具,你会采取哪些不同的做法？我正在考虑plpgsql,因为我怀疑这需要在数据库中完成,或者在数据插入过程中,尽管我通常在数据加载后处理它.有没有办法在一次扫描中得到这个而不诉诸子查询？

我已经测试了一种替代方法,但它仍然依赖于子查询,我认为这更快.对于该方法,您可以使用start_timestamp,end_timestamp,system创建“启动和停止”表.然后你加入更大的表,如果时间戳在那些之间,你将它归类为处于该状态,这实际上是cumlative_sum_of_nonrepeats_by_system的替代.但是当你这样做时,你加入1 = 1的数千个设备和数千或数百万’事件’.你认为这是一个更好的方式吗？

解决方法

测试用例

首先,一个更有用的方式来呈现您的数据 – 甚至更好,在sqlfiddle中,准备好玩：

CREATE TEMP TABLE data(
   system_measured int,time_of_measurement int,measurement int
);

INSERT INTO data VALUES
 (1,1,5),(1,2,150),3,4,(2,5,6,7,8,5);

简化查询

由于目前尚不清楚,我只假设上述情况.
接下来,我简化了您的查询以达到：

WITH x AS (
   SELECT *,CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
                               ORDER BY time_of_measurement) = measurement
                  THEN 0 ELSE 1 END AS step
   FROM   data
   ),y AS (
   SELECT *,sum(step) OVER(PARTITION BY system_measured
                            ORDER BY time_of_measurement) AS grp
   FROM   x
   )
SELECT *,row_number() OVER (PARTITION BY system_measured,grp
                             ORDER BY time_of_measurement) - 1 AS repeat_ct
FROM   y
ORDER  BY system_measured,time_of_measurement;

现在,虽然使用纯SQL一切都很好,但是使用plpgsql函数会更快,因为它可以在单个表扫描中执行此操作,此查询至少需要三次扫描.

使用plpgsql函数更快：

CREATE OR REPLACE FUNCTION x.f_repeat_ct()
  RETURNS TABLE (
    system_measured int,measurement int,repeat_ct int
  )  LANGUAGE plpgsql AS
$func$
DECLARE
   r    data;     -- table name serves as record type
   r0   data;
BEGIN

-- SET LOCAL work_mem = '1000 MB';  -- uncomment an adapt if needed,see below!

repeat_ct := 0;   -- init

FOR r IN
   SELECT * FROM data d ORDER BY d.system_measured,d.time_of_measurement
LOOP
   IF  r.system_measured = r0.system_measured
       AND r.measurement = r0.measurement THEN
      repeat_ct := repeat_ct + 1;   -- start new array
   ELSE
      repeat_ct := 0;               -- start new count
   END IF;

   RETURN QUERY SELECT r.*,repeat_ct;

   r0 := r;                         -- remember last row
END LOOP;

END
$func$;

呼叫：

SELECT * FROM x.f_repeat_ct();

确保在这种plpgsql函数中始终对列名进行表限定,因为我们使用相同的名称作为输出参数,如果不合格则优先使用.

数十亿行

如果您有数十亿行,则可能需要将此操作拆分.我引用手册here：

（编辑：济南站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

1/3

尾页