我有一个很大的Fasta文件,我想修改它 . 它基本上由许多序列组成, Headers 以“>”开头 . 我的问题是,即使序列是唯一的,一些 Headers 也不是唯一的 .
例:
acrdi | AD19082 STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP> acrdi | AD19082 MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
现在我想在我的大Fasta文件中找到所有重复项并将数字附加到重复项,以便我知道它是哪个副本(1,2,3,...,x) . 当找到新的副本(一个带有另一个 Headers )时,计数器应该从头开始 .
输出应该是这样的:
acrdi | AD19082 STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP> acrdi | AD19082-1 MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
我更喜欢使用awk或sed的方法,这样我就可以轻松地修改代码以在目录中的所有文件上运行 .
我不得不承认,我刚开始学习编程和解析,但我希望这不是一个愚蠢的问题 .
THX提前寻求帮助 .
1 回答
一个awk脚本:
上面使用“>”作为记录分隔符,并检查第一个字段(可以复制的 Headers 名称) . 对于它打印的每一行,它在该字段出现的每个额外时间后在 Headers 名称后面添加一个后缀(即第一个副本为“-1”,第二个为“-2”......)