首页 文章

在PYTHON中的两个CSV文件中查找公共区域

提问于
浏览
0

我有两个CSV文件,每个文件有10列,第一列称为“主键” .

我需要使用Python来查找两个CSV文件之间的公共区域 . 例如,我应该能够检测到CSV1中的行27-45等于CSV2中的行125-145,依此类推 .

我只是比较主键(第一列) . 其余数据不考虑进行比较 . 我需要在两个单独的CSV文件中提取这些公共区域(一个用于CSV1,一个用于CSV2) .

我已经解析并将两个CSV文件的行存储在两个'list of lists', lstCAN_LOG_TABLElstSHADOW_LOG_TABLE 中,因此问题会缩小到比较这两个列表列表 .

我目前假设的是,如果有10个后续比赛( MAX_COMMON_THRESHOLD ),我已到达公共区域的开头 . 我不能记录单行(比较真实),因为会有区域相等(按主键)和我需要识别的区域 .

for index in range(len(lstCAN_LOG_TABLE)):
    for l_index in range(len(lstSHADOW_LOG_TABLE)):
        if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]):  #Consider for comparison only CAN IDs
            index_can_log = index                                           #Position where CAN Log is to be compared
            index_shadow_log = l_index                                      #Position from where CAN Shadow Log is to be considered
            start = index_shadow_log
            if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
                end = index_shadow_log + MAX_COMMON_THRESHOLD
            else:
                end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
            can_index = index
            bPreScreened = 1
            for num in range(start,end):
                if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
                    if((can_index + 1) < (input_file_one_row_count-1)):
                        can_index = can_index + 1                           
                    else:
                        break   
                else:
                    bPreScreened = 0
                    print("No Match")
                    break
            #we might have found start of common region         
            if(bPreScreened == 1):      
                print("Start={0} End={1} can_index={2}".format(start,end,can_index))
                for number in range(start,end):
                    if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):                           
                        writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
                        writer_one.writerow(lstCAN_LOG_TABLE[index][0])
                        if((index + 1) < (input_file_one_row_count-1)):
                            index = index + 1                           
                        else:
                            dump_file.close()   
                            print("\nCommon Region in Two CSVs identifed and recorded\n")                           
                            return
dump_file.close()   
print("\nCommon Region in Two CSVs identifed and recorded\n")

我的输出很奇怪 . 即使第一个CSV文件只有1880行,但在第一个CSV的公共区域CSV中,我得到的条目也多得多 . 我没有得到理想的输出 .

EDITED FROM HERE

CSV1:

216 0.000238225 F4  41  C0  FB  28  0   0   0   MS CAN
109 0.0002256   15  8B  31  0   8   43  58  0   HS CAN
216 0.000238025 FB  47  C6  1   28  0   0   0   MS CAN
340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

CSV2:

216 0.0002361   0F  5C  DB  14  28  0   0   0   MS CAN
216 0.000236225 16  63  E2  1B  28  0   0   0   MS CAN
109 0.0001412   16  A3  31  0   8   63  58  0   HS CAN
216 0.000234075 1C  6A  E9  22  28  0   0   0   MS CAN
40A 0.000259925 C1  1   46  54  30  44  47  36  HS CAN
4A  0.000565975 2   0   0   0   0   0   0   C0  MS CAN
340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

EXPECTED OUTPUT CSV1:

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

EXPECTED OUTPUT CSV2:

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

OBSERVED OUTPUT CSV1

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

还有成千上万的冗余行数据

EDITED - SOLVED AS PER ADVICE (CHANGED FOR TO WHILE):

LEARNING: 在Python中,FOR循环索引无法在RunTime中更改

dump_file=open("MATCH_PATTERN.txt",'w+')
print("Number of Entries CAN LOG={0}".format(len(lstCAN_LOG_TABLE)))
print("Number of Entries SHADOW LOG={0}".format(len(lstSHADOW_LOG_TABLE)))  
index = 0   
while(index < (input_file_one_row_count - 1)):
    l_index = 0
    while(l_index < (input_file_two_row_count - 1)):
        if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]):  #Consider for comparison only CAN IDs
            index_can_log = index                                           #Position where CAN Log is to be compared
            index_shadow_log = l_index                                      #Position from where CAN Shadow Log is to be considered
            start = index_shadow_log
            can_index = index
            if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
                end = index_shadow_log + MAX_COMMON_THRESHOLD
            else:
                end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))              
            bPreScreened = 1
            for num in range(start,end):
                if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):                      
                    if((can_index + 1) < (input_file_one_row_count-1)):
                        can_index = can_index + 1                           
                    else:
                        break   
                else:
                    bPreScreened = 0
                    break
            #we might have found start of common region         
            if(bPreScreened == 1):      
                print("Shadow Start={0} Shadow End={1} CAN INDEX={2}".format(start,end,index))
                for number in range(start,end):
                    if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):                           
                        writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
                        writer_one.writerow(lstCAN_LOG_TABLE[index][0])
                        if((index + 1) < (input_file_one_row_count-1)):
                            index = index + 1
                        if((l_index + 1) < (input_file_two_row_count-1)):
                            l_index = l_index + 1                               
                        else:
                            dump_file.close()   
                            print("\nCommon Region in Two CSVs identifed and recorded\n")                           
                            return
            else:
                l_index = l_index + 1
        else:
            l_index = l_index + 1
    index = index + 1   
dump_file.close()   
print("\nCommon Region in Two CSVs identifed and recorded\n")

1 回答

  • 0

    indexfor 循环中的迭代器 . 如果你在循环中更改它,它将在每次迭代后重新分配 .

    比如, for 循环中的 index = 5index += 1 执行3次 . 现在 index = 8 . 但是在此迭代结束后,当您的代码返回 for 时,索引将被分配给 index x = 6 .

    试试以下示例:

    for index in range(0,5):
        print 'iterator:', index
        index = index + 2
        print 'index:', index
    

    输出将是:

    iterator: 0
    index: 2
    iterator: 1
    index: 3
    iterator: 2
    index: 4
    iterator: 3
    index: 5
    iterator: 4
    index: 6
    

    要解决此问题,您可能希望将 for 循环更改为 while 循环

    EDIT: 如果我在两个文件中没有't understand wrong, you were trying to find '相同的列并存储它们 . 如果是这种情况,实际上您可以使用以下代码轻松完成工作:

    import csv # import csv module to read csv files
    
    file1 = 'csv1.csv' # input file 1
    file2 = 'csv2.csv' # input file 2
    outfile = 'csv3.csv' # only have one output file since two output files will be the same
    
    read1 = csv.reader(open(file1, 'r')) # read input file 1
    write = csv.writer(open(outfile, 'w')) # write to output file
    
    # for each row in input file 1, compare it with each row in input file 2
    # if they are the same, write that row into output file
    for row1 in read1:
        read2 = csv.reader(open(file2, 'r'))
        for row2 in read2:
            if row1 == row2:
                write.writerow(row1)
    
    read1.close()
    write.close()
    

相关问题