首页 文章

将文件拆分为块

提问于
浏览
8

我试图拆分格式为的文件:

@some 
@garbage
@lines
@target G0.S0
@type xy
 -0.108847E+02  0.489034E-04
 -0.108711E+02  0.491023E-04
 -0.108574E+02  0.493062E-04
 -0.108438E+02  0.495075E-04
 -0.108302E+02  0.497094E-04
 ....Unknown line numbers...
&
@target G0.S1
@type xy
 -0.108847E+02  0.315559E-04
 -0.108711E+02  0.316844E-04
 -0.108574E+02  0.318134E-04
 ....Unknown line numbers...
&
@target G1.S0
@type xy
 -0.108847E+02  0.350450E-04
 -0.108711E+02  0.351669E-04
 -0.108574E+02  0.352908E-04
&
@target G1.S1
@type xy
 -0.108847E+02  0.216396E-04
 -0.108711E+02  0.217122E-04
 -0.108574E+02  0.217843E-04
 -0.108438E+02  0.218622E-04

@target Gx.Sy 组合是唯一的,每组数据始终由 & 定位 .

我已设法将文件拆分为块:

#!/usr/bin/env python3
import os
import sys
import itertools as it
import numpy as np
import matplotlib.pyplot as plt

try:
  filename = sys.argv[1]
  print(filename)
except IndexError:
  print("ERROR: Required filename not provided")

with open(filename, "r") as f:
  for line in f:
    if line.startswith("@target"):
      print(line.split()[-1].split("."))

x=[];y=[]
with open(filename, "r") as f:
  for key,group in it.groupby(f,lambda line: line.startswith('@target')):
    print(key)
    if not key:
        group = list(group)
        group.pop(0)
        # group.pop(-1)
        print(group)
        for i in range(len(group)):
          x.append(group[i].split()[0])
          y.append(group[i].split()[1])
        nx=np.array(x)
        ny=np.array(y)

我有两个问题:

1)真实数据之前的前导码行也被分组,因此如果有任何前导码,则脚本不起作用 . 无法预测将会有多少行;但我正试图将 after@target 分组

2)我想将数组命名为G0 [S0,S0]和G1 [S1,S2];但我不能这样做 .

请帮助

UPDATE :我试图将这些数据存储在G0 [S0,S1,...],G1 [S0,S1,..]的嵌套np数组中,以便我可以在matplotlib中使用它 .

3 回答

  • 1

    以下功能完成工作:

    import numpy as np
    from collections import defaultdict
    
    def read_without_preamble(filename):
        with open(filename, 'r') as f:
            lines = f.readlines()
        for i, line in enumerate(lines):
            if line.startswith('@target'):
                return lines[i:]
    
    def split_into_chunks(lines):
        chunks = defaultdict(dict)
        for line in lines:
            if line.startswith('@target'):
                GS_str = line.strip().split()[-1].split('.')
                G, S = map(lambda x: int(x[1:]), GS_str)
                chunks[G][S] = []
            elif line.startswith('@type xy'):
                pass
            elif line.startswith('&'):
                chunks[G][S] = np.asarray(chunks[G][S])
            else:
                xy_str = line.strip().split()
                chunks[G][S].append(map(float, xy_str))
        return chunks
    

    要将文件拆分为块,只需运行以下代码:

    try:
      filename = sys.argv[1]
      print(filename)
    except IndexError:
      print("ERROR: Required filename not provided")
    
    data = read_without_preamble(filename)
    chunks = split_into_chunks(data)
    

    逐步演示

    chunks 是一个字典,其中键是 G01 ):

    In [415]: type(chunks)
    Out[415]: dict
    
    In [416]: for k in chunks.keys(): print(k)
    0
    1
    

    字典 chunks 的值是另一个字典,其中键是 S (在此示例中为 012 ),该值是包含 Gi.Sn 的数字数据的NumPy数组 . 您可以访问此块数据,如下所示: chunks[i][n] ,其中indices in 分别是 GS 的值 .

    In [417]: type(chunks[0])
    Out[417]: dict
    
    In [418]: for k in chunks[0].keys(): print(k)
    0
    1
    2
    
    In [419]: type(chunks[1][2])
    Out[419]: numpy.ndarray
    
    In [420]: chunks[1][2]
    Out[420]: 
    array([[ -1.08851000e+01,   2.53058000e-05],
           [ -1.08715000e+01,   2.55353000e-05],
           [ -1.08579000e+01,   2.57745000e-05],
           [ -1.08443000e+01,   2.60225000e-05],
           [ -1.08306000e+01,   2.62617000e-05],
           [ -1.08170000e+01,   2.65097000e-05],
           [ -1.08034000e+01,   2.67666000e-05]])
    

    对于任何 inchunks[i][n].shape[0]2 ,但是 chunks[i][n].shape[1] 可以取任何值,即数字数据的行数可以从一个块到另一个块变化 .

    formatted_file.txt

    这是我在示例运行中使用的文件 . 它由六个块组成,即 G0.S0G0.S1G0.S2G1.S0G1.S1G1.S2 .

    @some 
    @garbage
    @lines
    @target G0.S0
    @type xy
     -0.108851E+02  0.127435E-03
     -0.108715E+02  0.127829E-03
     -0.108579E+02  0.128191E-03
     -0.108443E+02  0.128502E-03
     -0.108306E+02  0.128726E-03
     -0.108170E+02  0.128838E-03
     -0.108034E+02  0.128751E-03
    &
    @target G0.S1
    @type xy
     -0.108851E+02  0.472694E-04
     -0.108715E+02  0.474233E-04
     -0.108579E+02  0.475837E-04
     -0.108443E+02  0.477448E-04
     -0.108306E+02  0.479052E-04
     -0.108170E+02  0.480669E-04
     -0.108034E+02  0.482279E-04
    &
    @target G0.S2
    @type xy
     -0.108851E+02  0.253654E-04
     -0.108715E+02  0.255956E-04
     -0.108579E+02  0.258346E-04
     -0.108443E+02  0.260825E-04
     -0.108306E+02  0.263303E-04
     -0.108170E+02  0.265781E-04
     -0.108034E+02  0.268349E-04
    &
    @target G1.S0
    @type xy
     -0.108851E+02  0.108786E-03
     -0.108715E+02  0.109216E-03
     -0.108579E+02  0.109651E-03
     -0.108443E+02  0.110116E-03
     -0.108306E+02  0.110552E-03
     -0.108170E+02  0.111011E-03
     -0.108034E+02  0.111489E-03
    &
    @target G1.S1
    @type xy
     -0.108851E+02  0.278045E-04
     -0.108715E+02  0.278711E-04
     -0.108579E+02  0.279384E-04
     -0.108443E+02  0.280050E-04
     -0.108306E+02  0.280723E-04
     -0.108170E+02  0.281395E-04
     -0.108034E+02  0.282074E-04
    &
    @target G1.S2
    @type xy
     -0.108851E+02  0.253058E-04
     -0.108715E+02  0.255353E-04
     -0.108579E+02  0.257745E-04
     -0.108443E+02  0.260225E-04
     -0.108306E+02  0.262617E-04
     -0.108170E+02  0.265097E-04
     -0.108034E+02  0.267666E-04
    &
    
  • 1

    这是一种使用生成器和 np.genfromtxt 的方法 . 优点:记忆力强 . 它可以即时过滤文件,因此不需要将整个内容加载到内存中进行处理 .

    UPDATE:

    我简化了代码并将输出格式更改为数组数组 . 例如 G 范围介于 0...3S 之间的范围 0...5 之后,它会创建一个包含数组的4x6数组 .

    import numpy as np
    from itertools import dropwhile, groupby
    from operator import itemgetter
    
    def load_chunks(f):
        f = open(f, 'rt') if isinstance(f, str) else f
        f = filter(lambda l: not l.strip() in ("", "&"), f)
        tok = "@target", "@type"
        fg = dropwhile(itemgetter(0), groupby(f, lambda l: not l.split()[0] in tok))
        I, D = [], []
        for k, g in fg:
            info = next(l.split() for l in g)[1]
            I.append([int(key[1:]) for key in info.split('.')])
            D.append(np.genfromtxt((l.encode() for l in next(fg)[1])))
        G, S = np.array(I).T
        res = np.empty((np.max(G)+1, np.max(S)+1), dtype=object)
        res[G, S] = D
        return res
    
    fn = <your_file_name>
    
    ara = load_chunks(fn)
    
  • 1

    编辑 - 我采取了我的列表方法的反馈,并决定将其切换为词典 . 该解决方案具有对存储器消耗和完全动态的轻微的优点(即,不依赖于先验地知道G块的数量 .


    我已经使用了 re 包,这类似于 numpy 通过 loadtxt() 处理I / O的方式 . 此外,由于真的no point创建了一个嵌套的numpy数组numpy数组,我只是返回一个嵌套的内置 list numpy数组 . 由于您的数据是不均匀的,因此这种方法同样有效(并且更简单):

    import numpy as np
    import re
    from collections import defaultdict
    
    COMMENT_REGEX = re.compile(str('@'))
    TERMINATION_REGEX = re.compile(str('&'))
    TARGET_REGEX = re.compile(str('@target G(\d+).S(\d+)'))
    
    
    def load(filename):
        X = []
        g = None
        chunk_arr = []
        chunkd = defaultdict(dict)
    
        with open(filename) as fh:
            for line in fh:
                # comments match
                if COMMENT_REGEX.match(line):
                    target_match = TARGET_REGEX.match(line)
                    # look for target info
                    if target_match:
                        # start keeping track of g for the new group
                        g, s = [int(x) for x in target_match.groups()]
                        # reset x
                        X = []
                # chunk termination string match
                elif TERMINATION_REGEX.match(line):
                    if g is not None:
                        # create a np.array out of the previous chunk's data
                        X = np.array(X)
                        chunkd[g][s] = X
                # data found
                else:
                    # append data as a 2-element tuple onto a 1D list
                    X.append(tuple([float(x) for x in line.split()]))
    
        return chunkd
    

    只需将正确的G,S坐标传递给返回的 chunk_arr 即可进行访问 .

    arr = load('chunks.txt')
    print(arr[1][1])
    [[ -1.08847000e+01   4.89034000e-05]
    [ -1.08711000e+01   4.91023000e-05]
    [ -1.08574000e+01   4.93062000e-05]
    [ -1.08438000e+01   4.95075000e-05]
    [ -1.08302000e+01   4.97094000e-05]]
    

相关问题