使用RegEx提取多条具有可选字段的信息


Extract multiple pieces of information having optinal fields with RegEx

我希望您能帮助我了解如何从文本文件中提取这些信息。有一个字段是可选的(此处标记为"S")

文本如下:

名称案例编号工期计划日期日期A模式金额CRATE S账户合计PETER AB02651341 RN建筑商IUL CTAT 02/05/15 02/05/1501 380.00 0.0050 1.90JOHNSON,DON A BF06010672 FY agvant 15 NT1 02/02/15 02/01/15 01 83.04 0.0500 4.15SARA ZZ02659940 RN CUST GUAR 2015年1月31日2015年1年30日12 18450.00-0.0025 46.13-MIKE KH02979366 RN CUST GUAR 02/02/15 02/01/15 01 109.83 0.0025.50.14

是否可以将其输出为这样(在阵列或其他结构中):

名称案例编号工期计划日期ST日期A模式金额CRATE S账户合计PETER AB02651341 RN建筑商IUL CTAT 2015年2月2日2015年5月2日01 380.00 0.0050 1.90JOHNSON,DON A BF06010672 FY agvant 15 NT1 2015年2月2日2015年1月2日01 83.04 0.0500 4.15SARA ZZ02659940 RN CUST GUAR 2015年1月31日2015年1日12-18450.00 0.0025-46.13MIKE KH02979366 RN CUST GUAR 2015年2月2日01 109.83 0.0025.50.14

最终输出将是这样的:

Array ( [0] => Array ( [NAME] => PETER [Case No.] => AB02651341 [Duration] => RN [PLAN] => BUILDER IUL CTAT [DATE ST] => 02/02/15 [DATE A] => 02/05/2015 [MODE] => 01 [AMOUNT] => 380.00 [CRATE] => 0.0050 [S] => [AccountTotal] => 1.90 ) 
        [1] => Array ( [NAME] => JOHNSON, DON A [Case No.] => BF06010672 [Duration] => FY [PLAN] => AGGVANT 15 NT1 [DATE ST] => 02/2/2015 [DATE A] => 02/01/15 [MODE] => 01 [AMOUNT] => 83.04 [CRATE] => 0.0500 [S] => [AccountTotal] => 4.15 ) 
        [2] => Array ( [NAME] => SARA [Case No.] => ZZ02659940 [Duration] => RN [PLAN] => CUST GUAR [DATE ST] => 01/31/2015 [DATE A] => 01/30/2015 [MODE] => 12 [AMOUNT] => -18,450.00 [CRATE] => 0.0025 [S] => [AccountTotal] => -46.13 ) 
        [3] => Array ( [NAME] => MIKE [Case No.] => KH02979366 [Duration] => RN [PLAN] => CUST GUAR [DATE ST] => 02/02/15 [DATE A] => 02/01/2015 [MODE] => 01 [AMOUNT] => 109.83 [CRATE] => 0.0025 [S] => .50 [AccountTotal] => .14 ) )

也许这会奏效?

$a = <<<EOT
  NAME     Case No. Duration PLAN       ACCT DATE ST DATE A MODE   AMOUNT CRATE S AccountTotal
  PETER             AB02651341 RN BUILDER IUL CTAT 02/05/15 02/05/15 01             380.00   0.0050            1.90
  JOHNSON, DON A BF06010672 FY AGGVANT 15 NT1      02/02/15 02/01/15 01            83.04   0.0500            4.15
  SARA             ZZ02659940 RN CUST GUAR          01/31/15 01/30/15 12        18,450.00- 0.0025            46.13-
  MIKE              KH02979366 RN CUST GUAR        02/02/15 02/01/15 01             109.83   0.0025 .50         .14
EOT;
$cols = array(
    'NAME'         => ''s+(.*?)',
    'Case No.'     => ''s+('w'w'd{8})',
    'Duration'     => ''s('w'w)',
    'PLAN'         => ''s+(.*?)',
    'DATE ST'      => ''s+('d'd/'d'd/'d'd)',
    'DATE A'       => ''s+('d'd/'d'd/'d'd)',
    'MODE'         => ''s+('d'd)',
    'AMOUNT'       => ''s+('-?.*?)',
    'CRATE'        => ''s+('d+'.'d+)',
    'S'            => ''s+(['.'d]*)',
    'AccountTotal' => ''s+('-?.*?)$',
);
$result = array();
foreach (explode(PHP_EOL, $a) as $row) {
    if (preg_match('#' . implode(array_values($cols)) . '#', $row, $matches)) {
        // Move any trailing dash to the front of AMOUNT and
        // AccountTotal (a bit hackish - could be improved :)
        $matches[8]  = preg_replace('/(.*)-$/', '-$1', $matches[8]);
        $matches[11] = preg_replace('/(.*)-$/', '-$1', $matches[11]);
        $result[] = array_combine(array_keys($cols), array_slice($matches, 1));
    }
}
print_r($result);

输出:

Array
(
    [0] => Array
        (
            [NAME] => PETER
            [Case No.] => AB02651341
            [Duration] => RN
            [PLAN] => BUILDER IUL CTAT
            [DATE ST] => 02/05/15
            [DATE A] => 02/05/15
            [MODE] => 01
            [AMOUNT] => 380.00
            [CRATE] => 0.0050
            [S] => 
            [AccountTotal] => 1.90
        )
    [1] => Array
        (
            [NAME] => JOHNSON, DON A
            [Case No.] => BF06010672
            [Duration] => FY
            [PLAN] => AGGVANT 15 NT1
            [DATE ST] => 02/02/15
            [DATE A] => 02/01/15
            [MODE] => 01
            [AMOUNT] => 83.04
            [CRATE] => 0.0500
            [S] => 
            [AccountTotal] => 4.15
        )
    [2] => Array
        (
            [NAME] => SARA
            [Case No.] => ZZ02659940
            [Duration] => RN
            [PLAN] => CUST GUAR
            [DATE ST] => 01/31/15
            [DATE A] => 01/30/15
            [MODE] => 12
            [AMOUNT] => -18,450.00
            [CRATE] => 0.0025
            [S] => 
            [AccountTotal] => -46.13
        )
    [3] => Array
        (
            [NAME] => MIKE
            [Case No.] => KH02979366
            [Duration] => RN
            [PLAN] => CUST GUAR
            [DATE ST] => 02/02/15
            [DATE A] => 02/01/15
            [MODE] => 01
            [AMOUNT] => 109.83
            [CRATE] => 0.0025
            [S] => .50
            [AccountTotal] => .14
        )
)

您可以在regexp之后使用?来表示它是可选的。因此,如果XXX是该行前面部分的正则表达式,则可以编写:

preg_match('/^XXX(?:'s+(['d.]+))?'s+(['d.]+)$/', $line, $match);

未提供字段时,S字段的捕获组将为空。