首页 文章

基本/概念问题,使用Cypher和Neo4J查询性能

提问于
浏览
0

我正在做一个关于信用卡欺诈的项目,我在.CSV(管道分隔)中有一些生成的样本数据,其中每一行基本上是人的信息,交易细节以及商家名称等 . 生成数据,还有一个标志,指示此交易是否是欺诈性的 .

我试图做的是将数据加载到Neo4j,创建节点(人员,交易和商家),然后可视化欺诈性收费图表,看看是否有任何共同的商家 . (我知道有一个与此类似的neo4j数据集样本,但我试图将这个概念应用于一个单独的项目) .

我加载数据,创建约束,然后他们尝试我的查询,这似乎永远运行 .

以下是几行示例数据..

ssn|cc_num|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|acct_num|profile|trans_num|trans_date|trans_time|unix_time|category|amt|is_fraud|merchant|merch_lat|merch_long
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|2e5186427c626815e47725e59cb04c9f|2013-03-21|02:01:05|1363831265|misc_net|838.47|1|fraud_Greenfelder, Bartoletti and Davis|31.616203|-110.221915
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|7d3f5eae923428c51b6bb396a3b50aab|2013-03-22|22:36:52|1363991812|shopping_net|907.03|1|fraud_Gerlach Inc|32.142740|-111.675048
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|76083345f18c5fa4be6e51e4d0ea3580|2013-03-22|16:40:20|1363970420|shopping_pos|912.03|1|fraud_Morissette PLC|31.909227|-111.3878746

我正在使用的示例文件有大约60k的事务

以下是我的密码查询/代码 .

USING PERIODIC COMMIT 1000
 LOAD CSV WITH HEADERS FROM "card_data.csv"
 AS line FIELDTERMINATOR '|'


 CREATE (p:Person { id: toInt(line.cc_num), name_first: line.first, name_last: line.last })
 CREATE (m:Merchant { id: line.merchant, name: line.merchant })
 CREATE (t:Transaction { id: line.trans_num, merchant_name: line.merchant, card_number:line.cc_num, amount:line.amt, is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })

 create constraint on (t:Transaction) assert t.trans_num is unique;
 create constraint on (p:Person) assert p.cc_num is unique;

 MATCH (m:Merchant)
 WITH m
 MATCH (t:Transaction{merchant_name:m.merchant,is_fraud:1})
 CREATE (m)-[:processed]->(t)

你可以在第二个MATCH查询中看到,我试图指定我们只检查欺诈性交易(is_fraud:1),而在大约65k交易中,230个有is_fraud:1 .

任何想法为什么这个查询会无休止地运行?我确实有更多更大的数据集,我想用这种方式检查,到目前为止的小数据结果并不乐观(我确信由于我缺乏理解,而不是Neo4j的错误) .

2 回答

  • 0

    您不显示任何索引创建 . 为了加快速度,您应该在 merchant_nameis_fraud 上创建一个索引,以避免为给定的商家顺序遍历所有事务节点:

    CREATE INDEX ON :Transaction(merchant_name)
    CREATE INDEX ON :Transaction(is_fraud)
    
  • 0

    您为商家和人员创建了重复的条目 .

    // not really needed if you don't merge transactions
    // and if you don't look up transactions by trans_num
    // create constraint on (t:Transaction) assert t.trans_num is unique;
    
    // can't a person use multiple credit cards?
    create constraint on (p:Person) assert p.cc_num is unique;
    
    create constraint on (p:Person) assert p.id is unique;
    create constraint on (m:Merchant) assert m.id is unique;
    
    
    
    
    USING PERIODIC COMMIT 1000
     LOAD CSV WITH HEADERS FROM "card_data.csv" AS line FIELDTERMINATOR '|'
    
    
    MERGE (p:Person { id: toInt(line.cc_num)})
       ON CREATE SET p.name_first=line.first, p.name_last=line.las
    MERGE (m:Merchant { id: line.merchant}) ON CREATE SET m.name = line.merchant
    
    CREATE (t:Transaction { id: line.trans_num, card_number:line.cc_num, amount:line.amt, merchant_name: line.merchant,
            is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })
    
    CREATE (p)-[:issued]->(t)
    
    // only connect fraudulent transactions to the merchant
    WHERE t.is_fraud = 1
    // also add indicator label to transaction for easier selection / processing later
    SET t:Fraudulent     
    CREATE (m)-[:processed]->(t);
    

    或者,您可以将所有tx连接到商家,并仅通过标签/替代rel-types指示欺诈 .

相关问题