分析/改善内存使用和/或GC时间

作者：手机用户2402851335 | 2023-09-09 11:52

如何解决《分析/改善内存使用和/或GC时间》经验，为你挑选了0个好方法。

原版的

我正在尝试聚合CSV文件并体验[我认为的]过多的内存使用和/或GC工作.当群体数量增加时,似乎会出现这个问题.当密钥数百或数千时,没有问题,但当密钥达到数万时,很快就会开始在GC中花费大部分时间.

更新

移动Data.ByteString.Lazy.ByteString到Data.ByteString.Short.ShortByteString显著减少内存消耗(的水平,我认为是合理的).但是,在GC中花费的时间似乎仍远高于我预期的必要时间.我从感动Data.HashMap.Strict.HashMap到Data.HashTable.ST.Basic.HashTable,看是否在突变ST会帮助,但并没有出现.以下是当前的完整测试代码,包括generateFile创建测试样本:

{-# LANGUAGE OverloadedStrings #-}

module Main where

import System.IO (withFile, IOMode(WriteMode))
import qualified System.Random as Random

import qualified Data.ByteString.Short as BSS
import qualified Data.ByteString.Lazy.Char8 as BL
import qualified Data.Vector as V
import qualified Data.Vector.Mutable as MV
import qualified Control.Monad.ST as ST

import qualified Data.HashTable.ST.Basic as HT
import qualified Data.HashTable.Class as HT (toList)
import Data.Hashable (Hashable, hashWithSalt)

import Data.List (unfoldr)

import qualified Data.Traversable as T
import Control.Monad (forM_)

instance Hashable a => Hashable (V.Vector a) where
  hashWithSalt s = hashWithSalt s . V.toList

data CSVFormat = CSVFormat {
  csvSeparator :: Char,
  csvWrapper :: Char
}

readCSV :: CSVFormat -> Int -> FilePath -> IO [V.Vector BSS.ShortByteString]
readCSV format skip filepath = BL.readFile filepath >>= return . parseCSV format skip

parseCSV :: CSVFormat -> Int -> BL.ByteString -> [V.Vector BSS.ShortByteString]
parseCSV (CSVFormat sep wrp) skp = drop skp . unfoldr (\bs -> if BL.null bs then Nothing else Just (apfst V.fromList (parseLine bs)))
  where
    {-# INLINE apfst #-}
    apfst f (x,y) = (f x,y)

    {-# INLINE isCr #-}
    isCr c = c == '\r'

    {-# INLINE isLf #-}
    isLf c = c == '\n'

    {-# INLINE isSep #-}
    isSep c = c == sep || isLf c || isCr c

    {-# INLINE isWrp #-}
    isWrp c = c == wrp

    {-# INLINE parseLine #-}
    parseLine :: BL.ByteString -> ([BSS.ShortByteString], BL.ByteString)
    parseLine bs =
      let (field,bs') = parseField bs in
      case BL.uncons bs' of
        Just (c,bs1)
          | isLf c -> (field : [],bs1)
          | isCr c ->
              case BL.uncons bs1 of
                Just (c,bs2) | isLf c -> (field : [],bs2)
                _ -> (field : [],bs1)
          | otherwise -> apfst (field :) (parseLine bs1)
        Nothing -> (field : [],BL.empty)

    {-# INLINE parseField #-}
    parseField :: BL.ByteString -> (BSS.ShortByteString, BL.ByteString)
    parseField bs =
      case BL.uncons bs of
        Just (c,bs')
          | isWrp c -> apfst (BSS.toShort . BL.toStrict . BL.concat) (parseEscaped bs')
          | otherwise -> apfst (BSS.toShort . BL.toStrict) (BL.break isSep bs)
        Nothing -> (BSS.empty,BL.empty)

    {-# INLINE parseEscaped #-}
    parseEscaped :: BL.ByteString -> ([BL.ByteString], BL.ByteString)
    parseEscaped bs =
      let (chunk,bs') = BL.break isWrp bs in
      case BL.uncons bs' of
        Just (_,bs1) ->
          case BL.uncons bs1 of
            Just (c,bs2)
              | isWrp c -> apfst (\xs -> chunk : BL.singleton wrp : xs) (parseEscaped bs2)
              | otherwise -> (chunk : [],bs1)
            Nothing -> (chunk : [],BL.empty)
        Nothing -> error "EOF within quoted string"

aggregate :: [Int]
          -> Int
          -> [V.Vector BSS.ShortByteString]
          -> [V.Vector BSS.ShortByteString]
aggregate groups size records =
  let indices = [0..size - 1] in

  ST.runST $ do
    state <- HT.new

    forM_ records (\record -> do
        let key = V.fromList (map (\g -> record V.! g) groups)

        existing <- HT.lookup state key
        case existing of
          Just x ->
            forM_ indices (\i -> do
                current <- MV.read x i
                MV.write x i $! const current (record V.! i)
              )
          Nothing -> do
            x <- MV.new size
            forM_ indices (\i -> MV.write x i $! record V.! i)
            HT.insert state key x
      )

    HT.toList state >>= T.traverse V.unsafeFreeze . map snd

filedata :: IO ([Int],Int,[V.Vector BSS.ShortByteString])
filedata = do
  records <- readCSV (CSVFormat ',' '"') 1 "file.csv"
  return ([0,1,2],18,records)

main :: IO ()
main = do
  (key,len,records) <- filedata
  print (length (aggregate key len records))

generateFile :: IO ()
generateFile = do
  withFile "file.csv" WriteMode $ \handle -> do
    forM_ [0..650000] $ \_ -> do
      x <- BL.pack . show . truncate . (* 15 ) <$> (Random.randomIO :: IO Double)
      y <- BL.pack . show . truncate . (* 50 ) <$> (Random.randomIO :: IO Double)
      z <- BL.pack . show . truncate . (* 200) <$> (Random.randomIO :: IO Double)
      BL.hPut handle (BL.intercalate "," (x:y:z:replicate 15 (BL.replicate 20 ' ')))
      BL.hPut handle "\n"

我收到以下分析结果:

17,525,392,208 bytes allocated in the heap
27,394,021,360 bytes copied during GC
   285,382,192 bytes maximum residency (129 sample(s))
     3,714,296 bytes maximum slop
           831 MB total memory in use (0 MB lost due to fragmentation)

                                   Tot time (elapsed)  Avg pause  Max pause
Gen  0       577 colls,     0 par    1.576s   1.500s     0.0026s    0.0179s
Gen  1       129 colls,     0 par   25.335s  25.663s     0.1989s    0.2889s

TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT    time    0.000s  (  0.002s elapsed)
MUT     time   11.965s  ( 23.939s elapsed)
GC      time   15.148s  ( 15.400s elapsed)
RP      time    0.000s  (  0.000s elapsed)
PROF    time   11.762s  ( 11.763s elapsed)
EXIT    time    0.000s  (  0.088s elapsed)
Total   time   38.922s  ( 39.429s elapsed)

Alloc rate    1,464,687,582 bytes per MUT second

Productivity  30.9% of total user, 30.5% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

以下堆可视化:

推荐阅读

程序员
Elasticsearch - 配置没有标记器的小写分析器

如何解决《Elasticsearch-配置没有标记器的小写分析器》经验，为你挑选了1个好方法。 ... [详细]
程序员
使用对象变量屏蔽参数时,透明代码会引发错误

如何解决《使用对象变量屏蔽参数时,透明代码会引发错误》经验，为你挑选了0个好方法。 ... [详细]
程序员
Amazon DynamoDB Local - 未知错误,异常或失败

如何解决《AmazonDynamoDBLocal-未知错误,异常或失败》经验，为你挑选了1个好方法。 ... [详细]
程序员
Haskell,在bst中找到元素

如何解决《Haskell,在bst中找到元素》经验，为你挑选了1个好方法。 ... [详细]
程序员
使用Microsoft SQL Server更新多个列

如何解决《使用MicrosoftSQLServer更新多个列》经验，为你挑选了1个好方法。 ... [详细]
程序员
在WooCommerce中获取当前产品ID

如何解决《在WooCommerce中获取当前产品ID》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何在带有事件的运行时Image（WPF）中的图像上绘制点

如何解决《如何在带有事件的运行时Image（WPF）中的图像上绘制点》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何使用repmat将1d矢量重塑为3d矩阵？

如何解决《如何使用repmat将1d矢量重塑为3d矩阵？》经验，为你挑选了2个好方法。 ... [详细]
程序员
为什么Array.prototype.every在空数组上返回true？

如何解决《为什么Array.prototype.every在空数组上返回true？》经验，为你挑选了1个好方法。 ... [详细]
程序员
用封面图片标记mkv文件？

如何解决《用封面图片标记mkv文件？》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何在标记旁边显示位置名称

如何解决《如何在标记旁边显示位置名称》经验，为你挑选了0个好方法。 ... [详细]
程序员
如何理解这种功能声明？

如何解决《如何理解这种功能声明？》经验，为你挑选了1个好方法。 ... [详细]
程序员
window.safari在iframe中未定义

如何解决《window.safari在iframe中未定义》经验，为你挑选了0个好方法。 ... [详细]
程序员
的JavaScript.如果方法命名为字符串加args,我如何调用原型方法？

如何解决《的JavaScript.如果方法命名为字符串加args,我如何调用原型方法？》经验，为你挑选了1个好方法。 ... [详细]
程序员
在EPPlus导出中检测数据表日期字段和强制日期格式

如何解决《在EPPlus导出中检测数据表日期字段和强制日期格式》经验，为你挑选了1个好方法。 ... [详细]
程序员
使用多个线程时性能提升很少

如何解决《使用多个线程时性能提升很少》经验，为你挑选了1个好方法。 ... [详细]
程序员
读取字符串并与特定值进行比较

如何解决《读取字符串并与特定值进行比较》经验，为你挑选了1个好方法。 ... [详细]
程序员
无法在C++中的for循环条件语句中使用vector.size()

如何解决《无法在C++中的for循环条件语句中使用vector.size()》经验，为你挑选了1个好方法。 ... [详细]
程序员
使用carthage集成ReactiveCocoa失败

如何解决《使用carthage集成ReactiveCocoa失败》经验，为你挑选了1个好方法。 ... [详细]
程序员
将过滤器应用于WordPress短代码输出

如何解决《将过滤器应用于WordPress短代码输出》经验，为你挑选了1个好方法。 ... [详细]

手机用户2402851335

这个屌丝很懒，什么也没留下！

关注作者

Tags | 热门标签

RankList | 热门文章